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Foreword 


Instructors,  formal  and  informal  learners,  working  professionals,  and  readers  looking 
to  enhance,  update,  or  refresh  their  interactive  data  skills  and  methodological 
developments  may  selectively  choose  sections,  chapters,  and  examples  they  want 
to  cover  in  more  depth.  Everyone  who  expects  to  gain  new  knowledge  or  acquire 
computational  abilities  should  review  the  overall  textbook  organization  before  they 
decide  what  to  cover,  how  deeply,  and  in  what  order.  The  organization  of  the 
chapters  in  this  book  reflects  an  order  that  may  appeal  to  many,  albeit  not  all,  readers. 

Chapter  1  (Motivation)  presents  (1)  the  DSPA  mission  and  objectives,  (2)  sev¬ 
eral  driving  biomedical  challenges  including  Alzheimer’s  disease,  Parkinson’s  dis¬ 
ease,  drug  and  substance  use,  and  amyotrophic  lateral  sclerosis,  (3)  provides 
demonstrations  of  brain  visualization,  neurodegeneration,  and  genomics  computing, 

(4)  identifies  the  six  defining  characteristics  of  big  (biomedical  and  healthcare)  data, 

(5)  explains  the  concepts  of  data  science  and  predictive  analytics ,  and  (6)  sets  the 
DSPA  expectations. 

Chapter  2  (Foundations  of  R )  justifies  the  use  of  the  statistical  programming 
language  R  and  (1)  presents  the  fundamental  programming  principles;  (2)  illustrates 
basic  examples  of  data  transformation,  generation,  ingestion,  and  export;  (3)  shows 
the  main  mathematical  operators;  and  (4)  presents  basic  data  and  probability  distri¬ 
bution  summaries  and  visualization. 

In  Chap.  3  (Managing  Data  in  R),  we  present  additional  R  programming  details 
about  (1)  loading,  manipulating,  visualizing,  and  saving  R  Data  Structures; 

(2)  present  sample-based  statistics  measuring  central  tendency  and  dispersion; 

(3)  explore  different  types  of  variables;  (4)  illustrate  scrapping  data  from  public 
websites;  and  (5)  show  examples  of  cohort-rebalancing. 

A  detailed  discussion  of  Visualization  is  presented  in  Chap.  4  where  we 
(1)  show  graphical  techniques  for  exposing  composition,  comparison,  and  relation¬ 
ships  in  multivariate  data;  and  (2)  present  ID,  2D,  3D,  and  4D  distributions  along 
with  surface  plots. 

The  foundations  of  Linear  Algebra  and  Matrix  Computing  are  shown  in 
Chap.  5.  We  (1)  show  how  to  create,  interpret,  process,  and  manipulate 
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second-order  tensors  (matrices);  (2)  illustrate  variety  of  matrix  operations  and  their 
interpretations;  (3)  demonstrate  linear  modeling  and  solutions  of  matrix  equations; 
and  (4)  discuss  the  eigen-spectra  of  matrices. 

Chapter  6  (Dimensionality  Reduction)  starts  with  a  simple  example  reducing 
2D  data  to  ID  signal.  We  also  discuss  (1)  matrix  rotations,  (2)  principal  component 
analysis  (PCA),  (3)  singular  value  decomposition  (SVD),  (4)  independent  compo¬ 
nent  analysis  (ICA),  and  (5)  factor  analysis  (FA). 

The  discussion  of  machine  learning  model-based  and  model-free  techniques 
commences  in  Chap.  7  (Lazy  Learning  -  Classification  Using  Nearest  Neigh¬ 
bors).  In  the  scope  of  the  k-nearest  neighbor  algorithm,  we  present  (1)  the  general 
concept  of  divide-and-conquer  for  splitting  the  data  into  training  and  validation  sets, 

(2)  evaluation  of  model  performance,  and  (3)  improving  prediction  results. 

Chapter  8  (Probabilistic  Learning:  Classification  Using  Naive  Bayes)  pre¬ 
sents  the  naive  Bayes  and  linear  discriminant  analysis  classification  algorithms, 
identifies  the  assumptions  of  each  method,  presents  the  Laplace  estimator,  and 
demonstrates  step  by  step  the  complete  protocol  for  training,  testing,  validating, 
and  improving  the  classification  results. 

Chapter  9  (Decision  Tree  Divide  and  Conquer  Classification)  focuses  on 
decision  trees  and  (1)  presents  various  classification  metrics  (e.g.,  entropy, 
misclassification  error,  Gini  index),  (2)  illustrates  the  use  of  the  C5.0  decision  tree 
algorithm,  and  (3)  shows  strategies  for  pruning  decision  trees. 

The  use  of  linear  prediction  models  is  highlighted  in  Chap.  10  (Forecasting 
Numeric  Data  Using  Regression  Models).  Here,  we  present  (1)  the  fundamentals 
of  multivariate  linear  modeling,  (2)  contrast  regression  trees  vs.  model  trees,  and 

(3)  present  several  complete  end-to-end  predictive  analytics  examples. 

Chapter  11  (Black  Box  Machine-Learning  Methods:  Neural  Networks  and 

Support  Vector  Machines)  lays  out  the  foundation  of  Neural  Networks  as  silicon 
analogues  to  biological  neurons.  We  discuss  (1)  the  effects  of  network  layers  and 
topology  on  the  resulting  classification,  (2)  present  support  vector  machines  (SVM), 
and  (3)  demonstrate  classification  methods  for  optical  character  recognition  (OCR), 
iris  flowers  clustering,  Google  trends  and  the  stock  market  prediction,  and  quanti¬ 
fying  quality  of  life  in  chronic  disease. 

Apriori  Association  Rules  Learning  is  presented  in  Chap.  12  where  we  discuss 

(1)  the  foundation  of  association  rules  and  the  Apriori  algorithm,  (2)  support  and 
confidence  measures,  and  (3)  present  several  examples  based  on  grocery  shopping 
and  head  and  neck  cancer  treatment. 

Chapter  13  (k-Means  Clustering)  presents  (1)  the  basics  of  machine  learning 
clustering  tasks,  (2)  silhouette  plots,  (3)  strategies  for  model  tuning  and  improve¬ 
ment,  (4)  hierarchical  clustering,  and  (5)  Gaussian  mixture  modeling. 

General  protocols  for  measuring  the  performance  of  different  types  of  classifica¬ 
tion  methods  are  presented  in  Chap.  14  (Model  Performance  Assessment).  We 
discuss  (1)  evaluation  strategies  for  binary,  categorical,  and  continuous  outcomes; 

(2)  confusion  matrices  quantifying  classification  and  prediction  accuracy;  (3)  visual¬ 
ization  of  algorithm  performance  and  ROC  curves;  and  (4)  introduce  the  foundations 
of  internal  statistical  validation. 
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Chapter  15  (Improving  Model  Performance)  demonstrates  (1)  strategies  for 
manual  and  automated  model  tuning,  (2)  improving  model  performance  with  meta¬ 
learning,  and  (3)  ensemble  methods  based  on  bagging,  boosting,  random  forest,  and 
adaptive  boosting. 

Chapter  16  (Specialized  Machine  Learning  Topics)  presents  some  technical 
details  that  may  be  useful  for  some  computational  scientists  and  engineers.  There,  we 
discuss  (1)  data  format  conversion;  (2)  SQL  data  queries;  (3)  reading  and  writing 
XML,  JSON,  XLSX,  and  other  data  formats;  (4)  visualization  of  network  bioinfor¬ 
matics  data;  (4)  data  streaming  and  on-the-fly  stream  classification  and  clustering; 
(5)  optimization  and  improvement  of  computational  performance;  and  (6)  parallel 
computing. 

The  classical  approaches  for  feature  selection  are  presented  in  Chap.  17  (Vari¬ 
able/Feature  Selection)  where  we  discuss  (1)  filtering,  wrapper,  and  embedded 
techniques,  and  (2)  show  the  entire  protocols  from  data  collection  and  preparation  to 
model  training,  testing,  evaluation  and  comparison  using  recursive  feature 
elimination. 

In  Chap.  18  (Regularized  Linear  Modeling  and  Controlled  Variable  Selec¬ 
tion),  we  extend  the  mathematical  foundation  we  presented  in  Chap.  5  to  include 
fidelity  and  regularization  terms  in  the  objective  function  used  for  model-based 
inference.  Specifically,  we  discuss  (1)  computational  protocols  for  handling  complex 
high-dimensional  data,  (2)  model  estimation  by  controlling  the  false-positive  rate  of 
selection  of  critical  features,  and  (3)  derivations  of  effective  forecasting  models. 

Chapter  19  (BigBig  Longitudinal  Data  Analysis)  is  focused  on  interrogating 
time-varying  observations.  We  illustrate  (1)  time  series  analysis,  e.g.,  ARIMA 
modeling,  (2)  structural  equation  modeling  (SEM)  with  latent  variables,  (3)  longitu¬ 
dinal  data  analysis  using  linear  mixed  models,  and  (4)  the  generalized  estimating 
equations  (GEE)  modeling. 

Expanding  upon  the  term-frequency  and  inverse  document  frequency  techniques 
we  saw  in  Chap.  8,  Chap.  20  (Natural  Language  Processing/Text  Mining)  pro¬ 
vides  more  details  about  (1)  handling  unstructured  text  documents,  (2)  term  fre¬ 
quency  (TF)  and  inverse  document  frequency  (IDF),  and  (3)  the  cosine  similarity 
measure. 

Chapter  21  (Prediction  and  Internal  Statistical  Cross  Validation)  provides  a 
broader  and  deeper  discussion  of  method  validation,  which  started  in  Chap.  14. 
Here,  we  present  (1)  general  prediction  and  forecasting  methods,  (2)  demonstrate 
internal  statistical  n-fold  cross-validation,  and  (3)  comparison  strategies  for  multiple 
prediction  models. 

Chapter  22  (Function  Optimization)  presents  technical  details  about  minimiz¬ 
ing  objective  functions,  which  are  present  virtually  in  any  data  science  oriented 
inference  or  evidence-based  translational  study.  Here,  we  explain  (1)  constrained 
and  unconstrained  cost  function  optimization,  (2)  Lagrange  multipliers,  (3)  linear 
and  quadratic  programming,  (4)  general  nonlinear  optimization,  and  (5)  data 
denoising. 

The  last  chapter  of  this  textbook  is  Chap.  23  (Deep  Learning).  It  covers 
(1)  perceptron  activation  functions,  (2)  relations  between  artificial  and  biological 
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neurons  and  networks,  (3)  neural  nets  for  computing  exclusive  OR  (XOR)  and 
negative  AND  (NAND)  operators,  (3)  classification  of  handwritten  digits,  and 
(4)  classification  of  natural  images. 

We  compiled  a  few  dozens  of  biomedical  and  healthcare  case-studies  that  are 
used  to  demonstrate  the  presented  DSPA  concepts,  apply  the  methods,  and  validate 
the  software  tools.  For  example,  Chap.  1  includes  high-level  driving  biomedical 
challenges  including  dementia  and  other  neurodegenerative  diseases,  substance  use, 
neuroimaging,  and  forensic  genetics.  Chapter  3  includes  a  traumatic  brain  injury 
(TBI)  case-study,  Chap.  10  described  a  heart  attacks  case-study,  and  Chap.  11  uses 
a  quality  of  life  in  chronic  disease  data  to  demonstrate  optical  character  recognition 
that  can  be  applied  to  automatic  reading  of  handwritten  physician  notes.  Chapter  18 
presents  a  predictive  analytics  Parkinson’s  disease  study  using  neuroimaging- 
genetics  data.  Chapter  20  illustrates  the  applications  of  natural  language  processing 
to  extract  quantitative  biomarkers  from  unstructured  text,  which  can  be  used  to  study 
hospital  admissions,  medical  claims,  or  patient  satisfaction.  Chapter  23  shows 
examples  of  predicting  clinical  outcomes  for  amyotrophic  lateral  sclerosis  and 
irritable  bowel  syndrome  cohorts,  as  well  as  quantitative  and  qualitative  classifica¬ 
tion  of  biological  images  and  volumes.  Indeed,  these  represent  just  a  few  examples, 
and  the  readers  are  encouraged  to  try  the  same  methods,  protocols  and  analytics  on 
other  research-derived,  clinically  acquired,  aggregated,  secondary -use,  or  simulated 
datasets. 

The  online  appendices  (http ://DSPA. predictive. space)  are  continuously  expanded 
to  provide  more  details,  additional  content,  and  expand  the  DSPA  methods  and 
applications  scope.  Throughout  this  textbook,  there  are  cross-references  to  appro¬ 
priate  chapters,  sections,  datasets,  web  services,  and  live  demonstrations  (Live 
Demos).  The  sequential  arrangement  of  the  chapters  provides  a  suggested  reading 
order;  however,  alternative  sorting  and  pathways  covering  parts  of  the  materials  are 
also  provided.  Of  course,  readers  and  instructors  may  further  choose  their  own 
coverage  paths  based  on  specific  intellectual  interests  and  project  needs. 
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Genesis 

Since  the  turn  of  the  twenty-first  century,  the  evidence  overwhelming  reveals  that  the 
rate  of  increase  for  the  amount  of  data  we  collect  doubles  each  12-14  months 
(Kryder’s  law).  The  growth  momentum  of  the  volume  and  complexity  of  digital 
information  we  gather  far  outpaces  the  corresponding  increase  of  computational 
power,  which  doubles  each  18  months  (Moore’s  law).  There  is  a  substantial  imbal¬ 
ance  between  the  increase  of  data  inflow  and  the  corresponding  computational 
infrastructure  intended  to  process  that  data.  This  calls  into  question  our  ability  to 
extract  valuable  information  and  actionable  knowledge  from  the  mountains  of  digital 
information  we  collect.  Nowadays,  it  is  very  common  for  researchers  to  work  with 
petabytes  (PB)  of  data,  1 PB  =  1015  bytes,  which  may  include  nonhomologous 
records  that  demand  unconventional  analytics.  For  comparison,  the  Milky  Way 
Galaxy  has  approximately  2  x  1011  stars.  If  each  star  represents  a  byte,  then  one 
petabyte  of  data  correspond  to  5,000  Milky  Way  Galaxies. 

This  data  storage-computing  asymmetry  leads  to  an  explosion  of  innovative  data 
science  methods  and  disruptive  computational  technologies  that  show  promise  to 
provide  effective  (semi-intelligent)  decision  support  systems.  Designing,  under¬ 
standing  and  validating  such  new  techniques  require  deep  within-discipline  basic 
science  knowledge,  transdisciplinary  team-based  scientific  collaboration,  open- 
scientific  endeavors,  and  a  blend  of  exploratory  and  confirmatory  scientific  discov¬ 
ery.  There  is  a  pressing  demand  to  bridge  the  widening  gaps  between  the  needs  and 
skills  of  practicing  data  scientists,  advanced  techniques  introduced  by  theoreticians, 
algorithms  invented  by  computational  scientists,  models  constructed  by  biosocial 
investigators,  network  products  and  Internet  of  Things  (IoT)  services  engineered  by 
software  architects. 
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Purpose 

The  purpose  of  this  book  is  to  provide  a  sufficient  methodological  foundation  for  a 
number  of  modern  data  science  techniques  along  with  hands-on  demonstration  of 
implementation  protocols,  pragmatic  mechanics  of  protocol  execution,  and  interpre¬ 
tation  of  the  results  of  these  methods  applied  on  concrete  case-studies.  Successfully 
completing  the  Data  Science  and  Predictive  Analytics  (DSPA)  training  materials 
(http ://predictive. space)  will  equip  readers  to  (1)  understand  the  computational 
foundations  of  Big  Data  Science;  (2)  build  critical  inferential  thinking;  (3)  lend  a 
tool  chest  of  R  libraries  for  managing  and  interrogating  raw,  derived,  observed, 
experimental,  and  simulated  big  healthcare  datasets;  and  (4)  furnish  practical  skills 
for  handling  complex  datasets. 


Limitations/Prerequisites 

Prior  to  diving  into  DSPA,  the  readers  are  strongly  encouraged  to  review  the 
prerequisites  and  complete  the  self-assessment  pretest.  Sufficient  remediation  mate¬ 
rials  are  provided  or  referenced  throughout.  The  DSPA  materials  may  be  used  for 
variety  of  graduate  level  courses  with  durations  of  10-30  weeks,  with  3-4  instruc¬ 
tional  credit  hours  per  week.  Instructors  can  refactor  and  present  the  materials  in 
alternative  orders.  The  DSPA  chapters  in  this  book  are  organized  sequentially. 
However,  the  content  can  be  tailored  to  fit  the  audience’s  needs.  Learning  data 
science  and  predictive  analytics  is  not  a  linear  process  -  many  alternative  pathways 
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can  be  completed  to  gain  complementary  competencies.  We  developed  an  interac¬ 
tive  and  dynamic  flowchart  (http://socr.umich.edu/people/dinov/courses/DSPA_ 
Book_FlowChart.html)  that  highlights  several  tracks  illustrating  reasonable  path¬ 
ways  starting  with  Foundations  of  R  and  ending  with  specific  competency  topics. 
The  content  of  this  book  may  also  be  used  for  self-paced  learning  or  as  a  refresher  for 
working  professionals,  as  well  as  for  formal  and  informal  data  science  training, 
including  massive  open  online  courses  (MOOCs).  The  DSPA  materials  are  designed 
to  build  specific  data  science  skills  and  predictive  analytic  competencies,  as 
described  by  the  Michigan  Institute  for  Data  Science  (MIDAS). 


Scope  of  the  Book 

Throughout  this  book,  we  use  a  constructive  definition  of  “Big  Data”  derived  by 
examining  the  common  characteristics  of  many  dozens  of  biomedical  and  healthcare 
case-studies,  involving  complex  datasets  that  required  special  handling,  advanced 
processing,  contemporary  analytics,  interactive  visualization  tools,  and  translational 
interpretation.  These  six  characteristics  of  “Big  Data”  are  defined  in  the  Motivation 
Chapter  as  size,  heterogeneity  and  complexity,  representation  incongruency,  incom¬ 
pleteness,  multiscale  format,  and  multisource  origins.  All  supporting  electronic 
materials,  including  datasets,  assessment  problems,  code,  software  tools,  videos, 
and  appendices,  are  available  online  at  http://DSPA.predictive.space. 

This  textbook  presents  a  balanced  view  of  the  mathematical  formulation,  com¬ 
putational  implementation,  and  health  applications  of  modern  techniques  for  man¬ 
aging,  processing,  and  interrogating  big  data.  The  intentional  focus  on  human  health 
applications  is  demonstrated  by  a  diverse  range  of  biomedical  and  healthcare  case- 
studies.  However,  the  same  techniques  could  be  applied  in  other  domains,  e.g., 
climate  and  environmental  sciences,  biosocial  sciences,  high-energy  physics,  astron¬ 
omy,  etc.,  that  deal  with  complex  data  possessing  the  above  characteristics.  Another 
specific  feature  of  this  book  is  that  it  solely  utilizes  the  statistical  computing 
language  R ,  rather  than  any  other  scripting,  user-interface  based,  or  software  pro¬ 
gramming  alternatives.  The  choice  for  R  is  justified  in  the  Foundations  Chapter. 

All  techniques  presented  here  aim  to  obtain  data-driven  and  evidence-based 
scientific  inference.  This  process  starts  with  collecting  or  retrieving  an  appropriate 
dataset,  and  identifying  sources  of  data  that  need  to  be  harmonized  and  aggregated 
into  a  joint  computable  data  object.  Next,  the  data  are  typically  split  into  training  and 
testing  components.  Model-based  or  model-free  methods  are  fit,  estimated,  or 
learned  on  the  training  component  and  then  validated  on  the  complementary  testing 
data.  Different  types  of  expected  outcomes  and  results  from  this  process  include 
prediction,  prognostication,  or  forecasting  of  specific  clinical  traits  (computable 
phenotypes),  clustering,  or  classification  that  labels  units,  subjects,  or  cases  in  the 
data.  The  final  steps  include  algorithm  fine-tuning,  assessment,  comparison,  and 
statistical  validation. 
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explicit  or  implicit  indication  of  FDA  approval! 

Any  and  all  liability  arising  directly  or  indirectly  from  the  use  of  the  DSPA 
resources  is  hereby  disclaimed.  The  DSPA  resources  are  provided  “as  is”  and 
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The  following  common  notations  are  used  throughout  this  textbook. 


Notation 


Description 


http://www.socr.umich.edu/ 

people/dinov/courses/ 

DSPA_Topics.html 


A  link  to  an  interactive  live  web  demonstration. 

Some  of  these  Live  Demos  require  modern  Java  and 
JavaScript  enabled  browsers  and  Internet  access. 


require (ggpLot2) 

#  Comments  Loading  required  package: 
ggpLot2 

Data_R_SAS_SPSS_Pubs  <- 


R  fragments  of  code,  reported  results  in  the  output  shell,  or 


read. csv( 'https://umich.edu/data ' t 
header=T) 

df  < -  data . frame (Data_R_SAS_SPSS_Pubs ) 

#  convent  to  long  format 

df  <-  meLt(df  }  id.vars  =  'Year' , 

variable. name  =  'Software') 
ggplot(data=df j  aes(x=Yearj  y=valuej 
color=Softwarej  group  = 
Software))  + 

geom_line()  +  geom 
##  3  la 


comments.  The  complete  library  of  all  code  presented  in  the 
textbook  is  available  in  electronic  format  on  the  DSPA  site. 
Note  that: 

"#"  is  used  for  comments, 

"##"  indicates  R  textual  output, 
the  R  code  is  color-coded  to  identify  different 
types  of  comments,  instructions,  commands 
and  parameters, 

Output  like  "##  ...  ##"  suggests  that  some  of 


##20  3c 

data_Long 

##  CaselD  Gender  Feature 
Measurement 

##  1  1  M  Age 

5.0 


the  R  output  is  deleted  or  compressed  to 
save  space,  and 

indenting  is  used  to  visually  determine  the 
scope  of  a  method,  command,  or  an  expression 


##  2  2  F  Age 

6.0 


<-  or  -» 

In  an  asymptotic  or  limiting  sense,  tending  to,  convergence,  or 
approaching  a  value  or  a  limit. 

«  or  » 

Left  hand  size  is  substantially  smaller  or  larger  than  the  right 
hand  side. 

Depending  on  the  context,  model  definition,  similar  to,  approxi¬ 
mately  equal  to,  or  equivalent  (in  probability  distribution  sense). 

package: :function 

A  standard  reference  notation  to  functions  members  of 
specific  R  packages. 

Case-studies 

https://umich.instructure.com/courses/38100/files/folder/Ca 

se  Studies 

Electronic  Materials 

http://DSPA.  predictive. space 

Also  see  the  Glossary  and  the  Index,  located  in  the  end  of  the  book. 

Fig.  2  Common  DSPA  notations 
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1.1  DSPA  Mission  and  Objectives 

This  textbook  is  based  on  the  Data  Science  and  Predictive  Analytics  (DSPA)  course 
taught  by  the  author  at  the  University  of  Michigan.  These  materials  collectively  aim 
to  provide  learners  with  a  solid  foundation  of  the  challenges,  opportunities,  and 
strategies  for  designing,  collecting,  managing,  processing,  interrogating,  analyzing, 
and  interpreting  complex  health  and  biomedical  datasets.  Readers  that  finish  this 
textbook  and  successfully  complete  the  examples  and  assignments  will  gain  unique 
skills  and  acquire  a  tool-chest  of  methods,  software  tools,  and  protocols  that  can  be 
applied  to  a  broad  spectrum  of  Big  Data  problems. 

The  DSPA  textbook  vision,  values,  and  priorities  are  summarized  below: 

•  Vision:  Enable  active  learning  by  integrating  driving  motivational  challenges 
with  mathematical  foundations,  computational  statistics,  and  modern  scientific 
inference. 

•  Values:  Effective,  reliable,  reproducible,  and  transformative  data-driven  discov¬ 
ery  supporting  open  science. 

•  Strategic  priorities:  Trainees  will  develop  scientific  intuition,  computational 
skills,  and  data- wrangling  abilities  to  tackle  big  biomedical  and  health  data 
problems.  Instructors  will  provide  well-documented  R-scripts  and  software  rec¬ 
ipes  implementing  atomic  data  filters  as  well  as  complex  end-to-end  predictive 
big  data  analytics  solutions. 

Before  diving  into  the  mathematical  algorithms,  statistical  computing  methods, 
software  tools,  and  health  analytics  covered  in  the  remaining  chapters,  we  will 
discuss  several  driving  motivational  problems.  These  will  ground  all  the  subsequent 
scientific  discussions,  data  modeling  techniques,  and  computational  approaches. 


©  Ivo  D.  Dinov  2018 
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1  Motivation 


1.2  Examples  of  Driving  Motivational  Problems 
and  Challenges 

For  each  of  the  studies  below,  we  illustrate  several  clinically  relevant  scientific 
questions,  identify  appropriate  data  sources,  describe  the  types  of  data  elements, 
and  pinpoint  various  complexity  challenges. 


1.2.1  Alzheimer ’s  Disease 

•  Identify  the  relation  between  observed  clinical  phenotypes  and  expected 
behavior. 

•  Prognosticate  future  cognitive  decline  (3-12  months,  prospectively)  as  a  function 
of  imaging  data  and  clinical  assessment  (both  model-based  and  model-free 
machine  learning  prediction  methods  will  be  used). 

•  Derive  and  interpret  the  classifications  of  subjects  into  clusters  using  the  harmo¬ 
nized  and  aggregated  data  from  multiple  sources  (Fig.  1.1). 


1.2.2  Parkinson ’s  Disease 

•  Predict  the  clinical  diagnosis  of  patients  using  all  available  data  (with  and  without 
the  unified  Parkinson’s  disease  rating  scale  (UPDRS)  clinical  assessment,  which 
is  the  basis  of  the  clinical  diagnosis  by  a  physician). 

•  Compute  derived  neuroimaging  and  genetics  biomarkers  that  can  be  used  to 
model  the  disease  progression  and  provide  automated  clinical  decisions  support. 

•  Generate  decision  trees  for  numeric  and  categorical  responses  (representing 
clinically  relevant  outcome  variables)  that  can  be  used  to  suggest  an  appropriate 
course  of  treatment  for  specific  clinical  phenotypes  (Fig.  1.2). 


Data 

Source 

Sample  Size/Data  Type 

Summary 

ADNI 

Archive 

Clinical  data:  demographics,  clinical  assessments,  cognitive 
assessments;  Imaging  data:  sMRI,  fMRI,  DTI,  PiB/FDG  PET; 
Genetics  data:  lllumina  SNP  genotyping;  Chemical 
biomarker:  lab  tests,  proteomics.  Each  data  modality  comes 
with  a  different  number  of  cohorts.  Generally,  200  <  N  < 
1200.  For  instance,  previously  conducted  ADNI  studies 
with  N  >  500  [  doi:  10.3233/JAD-150335,  doi: 

10.1111/jon. 12252,  doi:  10.3389/fninf.2014.000411. 

ADNI  provides  interesting 
data  modalities,  multiple 
cohorts  (e.g.,  early-onset, 
mild,  and  severe  dementia, 
controls)  that  allow  effective 
model  training  and  validation 
NACC  Archive. 

Fig.  1.1  Outline  of  an  Alzheimer’s  disease  case-study 
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Data 

Source 


Sample  Size/Data  Type 


Summary 


PPMI 

Archive 


Demographics:  age,  medical  history,  sex;  Clinical  data: 
physical,  verbal  learning  and  language,  neurological  and 
olfactory  (University  of  Pennsylvania  Smell 
Identification  Test,  UPSIT)  tests,  vital  signs,  MDS-UPDRS 
scores  (Movement  Disorder;  Society-Unified  Parkinson's 
Disease  Rating  Scale),  ADL  (activities  of  daily  living), 
Montreal  Cognitive  Assessment  (MoCA),  Geriatric 
Depression  Scale  (GDS-15);  Imaging  data:  structural 
MRI;  Genetics  data:  lllumina  ImmunoChip  (196,524 
variants)  and  NeuroX  (covering  240,000  exonic  variants) 
with  100%  sample  success  rate,  and  98.7%  genotype 
success  rate  genotyped  for  APOE  e2/e3/e4.  Three 
cohorts  of  subjects;  Group  1  =  {de  novo  PD  Subjects 
with  a  diagnosis  of  PD  for  two  years  or  less  who  are  not 
taking  PD  medications},  N1  =  263;  Group  2  =  {PD 
Subjects  with  Scans  without  Evidence  of  a  Dopaminergic 
Deficit  (SWEDD)},  N2  =  40;  Group  3  =  {Control  Subjects 
without  PD  who  are  30  years  or  older  and  who  do  not 
have  a  first  degree  blood  relative  with  PD},  N3  =  127. 


The  longitudinal  PPMI  dataset 
including  clinical,  biological,  and 
imaging  data  (screening,  baseline, 
12,  24,  and  48  month  follow-ups) 
may  be  used  conduct  model-based 
predictions  as  well  as  model-free 
classification  and  forecasting 
analyses. 


Fig.  1.2  Outline  of  a  Parkinson’s  disease  case-study 


Data 

Source 

Sample  Size/Data  Type 

Summary 

MAWS 

Data  / 
UMHS  EHR 

/WHO 

AWS  Data 

Scores  from  Alcohol  Use  Disorders 
Identification  Test-Consumption  (AUDIT- 
C),  including  dichotomous  variables  for 
any  current  alcohol  use  (AUDIT-C, 
question  1),  total  AUDIT-C  score  >  8,  and 
any  positive  history  of  alcohol 
withdrawal  syndrome  (HAWS). 

~1,000  positive  cases  per  year  among  10,000 
adult  medical  inpatients,  %  RAWS  screens 
completed,  %  positive  screens,  %  entered 
into  MAWS  protocol  who  receive 
pharmacological  treatment  for  AWS,  % 
entered  into  MAWS  protocol  without  a 
completed  RAWS  screen. 

Fig.  1.3  Outline  of  a  substance  use  case-study 

1.2.3  Drug  and  Substance  Use 

•  Is  the  Risk  for  Alcohol  Withdrawal  Syndrome  (RAWS)  screen  a  valid  and 
reliable  tool  for  predicting  alcohol  withdrawal  in  an  adult  medical  inpatient 
population? 

•  What  is  the  optimal  cut-off  score  from  the  AUDIT-C  to  predict  alcohol  with¬ 
drawal  based  on  RAWS  screening? 

•  Should  any  items  be  deleted  from,  or  added  to,  the  RAWS  screening  tool  to 
enhance  its  performance  in  predicting  the  emergence  of  alcohol  withdrawal 
syndrome  in  an  adult  medical  inpatient  population?  (Fig.  1.3) 
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Data 

Source 


Sample  Size/Data  Type 


Summary 


ProAct 

Archive 


Over  100  clinical  variables  are  recorded  for  all 
subjects  including:  Demographics:  age,  race, 
medical  history,  sex;  Clinical  data:  Amyotrophic 
Lateral  Sclerosis  Functional  Rating  Scale  (ALSFRS), 
adverse  events,  onset_delta,  onset_site,  drugs  use 
(riluzole).  The  PRO-ACT  training  dataset  contains 
clinical  and  lab  test  information  of  8,635  patients. 
Information  of  2,424  study  subjects  with  valid  gold 
standard  ALSFRS  slopes  will  be  used  in  out 
processing,  modeling  and  analysis. 


The  time  points  for  all  longitudinally 
varying  data  elements  will  be 
aggregated  into  signature  vectors. 
This  will  facilitate  the  modeling  and 
prediction  of  ALSFRS  slope  changes 
over  the  first  three  months  (baseline 
to  month  3). 


Fig.  1.4 


Outline  of  an  amyotrophic  lateral  sclerosis  (Lou  Gehrig’s  disease)  case-study 


1.2.4  Amyotrophic  Lateral  Sclerosis 

•  Identify  the  most  highly  significant  variables  that  have  power  to  jointly  predict  the 
progression  of  ALS  (in  terms  of  clinical  outcomes  like  ALSFRS  and  muscle 
function). 

•  Provide  a  decision  tree  prediction  of  adverse  events  based  on  subject  phenotype 
and  0-3-month  clinical  assessment  changes  (Fig.  1.4). 


1.2.5  Normal  Brain  Visualization 

The  SOCR  Brain  Visualization  tool  (http://socr.umich.edu/HTML5/BrainViewer) 
has  preloaded  sMRI,  ROI  labels,  and  fiber  track  models  for  a  normal  brain.  It  also 
allows  users  to  drag  and  drop  their  data  into  the  browser  to  visualize  and  navigate 
through  the  stereotactic  data  (including  imaging,  parcellations,  and  tractography) 
(Fig.  1.5). 


1.2.6  Neurodegeneration 

A  recent  study  of  Structural  Neuroimaging  in  Alzheimer’s  disease  (https ://www. 
ncbi.nlm.nih.gov/pubmed/26444770)  illustrates  the  Big  Data  challenges  in  model¬ 
ing  complex  neuroscientific  data.  Specifically,  808  ADNI  subjects  were  divided  into 
3  groups:  200  subjects  with  Alzheimer’s  disease  (AD),  383  subjects  with  mild 
cognitive  impairment  (MCI),  and  225  asymptomatic  normal  controls  (NC).  Their 
sMRI  data  were  parcellated  using  BrainParser,  and  the  80  most  important  neuroim¬ 
aging  biomarkers  were  extracted  using  the  global  shape  analysis  pipeline  workflow. 
Using  a  pipeline  implementation  of  Plink,  the  authors  obtained  80  SNPs  highly 
associated  with  the  imaging  biomarkers.  The  authors  observed  significant 
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Fig.  1.5  Interactive  3D  brain  visualization 

correlations  between  genetic  and  neuroimaging  phenotypes  in  the  808  ADNI  sub¬ 
jects.  These  results  suggest  that  differences  between  AD,  MCI,  and  NC  cohorts  may 
be  examined  by  using  powerful  joint  models  of  morphometric,  imaging,  and  geno¬ 
typic  data  (Fig.  1.6). 


1.2.7  Genetic  Forensics:  2013-2016  Ebola  Outbreak 

This  Howard  Hughes  Medical  Institute  (HHMI)  disease  detective  activity  illustrates 
the  genetic  analysis  of  sequences  of  Ebola  viruses  isolated  from  patients  in  Sierra 
Leone  during  the  Ebola  outbreak  of  2013-2016.  Scientists  track  the  spread  of  the 
virus  using  the  fact  that  most  of  the  genome  is  identical  among  individuals  of  the 
same  species,  most  similar  for  genetically  related  individuals,  and  more  different  as 
the  hereditary  distance  increases.  DNA  profiling  capitalizes  on  these  genetic  differ¬ 
ences  particularly  in  regions  of  noncoding  DNA,  which  is  DNA  that  is  not  tran¬ 
scribed  and  translated  into  a  protein.  Variations  in  noncoding  regions  have  less 
impact  on  individual  traits.  Such  changes  in  noncoding  regions  may  be  immune  to 
natural  selection.  DNA  variations  called  short  tandem  repeats  (STRs)  are  com¬ 
prised  on  short  bases,  typically  2-5  bases  long,  that  repeat  multiple  times.  The  repeat 
units  are  found  at  different  locations,  or  loci,  throughout  the  genome.  Every  STR  has 
multiple  alleles.  These  allele  variants  are  defined  by  the  number  of  repeat  units 
present  or  by  the  length  of  the  repeat  sequence.  STRs  are  surrounded  by 
nonvariable  segments  of  DNA  known  as  flanking  regions.  The  STR  allele  in 
Fig.  1.7  could  be  denoted  by  “6”,  as  the  repeat  unit  (GATA)  repeats  6  times,  or  as 
70  base  pairs  (bps)  because  its  length  is  70  bases  in  length,  including  the  starting/ 
ending  flanking  regions.  Different  alleles  of  the  same  STR  may  correspond  to 
different  number  of  GATA  repeats,  with  the  same  flanking  regions. 
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A:  Individual  brain  pareellation  B:  LPBA40  atlas 
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Fig.  1.6  Indices  of  the  56  regions  of  interest  (ROIs):  A  and  B  -  extracted  by  the  BrainParser 
software  using  the  LPBA40  brain  atlas 


1.2.8  Next  Generation  Sequence  (NGS)  Analysis 

Whole-genome  and  exome  sequencing  include  essential  clues  for  identifying  genes 
responsible  for  simple  Mendelian  inherited  disorders.  A  recent  paper  proposed 
methods  that  can  be  applied  to  complex  disorders  based  on  population  genetics. 
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Fig.  1.7  Snippet  of  the 
Ebola  STR  genomic 
sequence 


Next  generation  sequencing  (NGS)  technologies  include  bioinformatics  resources  to 
analyze  the  dense  and  complex  sequence  data.  The  Graphical  Pipeline  for  Compu¬ 
tational  Genomics  (GPCG)  performs  the  computational  steps  required  to  analyze 
NGS  data.  The  GPCG  implements  flexible  workflows  for  basic  sequence  alignment, 
sequence  data  quality  control,  single  nucleotide  polymorphism  analysis,  copy  num¬ 
ber  variant  identification,  annotation,  and  visualization  of  results.  Applications  of 
NGS  analysis  provide  clinical  utility  for  identifying  miRNA  signatures  in  diseases. 
Enabling  hypotheses  testing  about  the  functional  role  of  variants  in  the  human 
genome  will  help  to  pinpoint  the  genetic  risk  factors  many  diseases  (e.g.,  neuropsy¬ 
chiatric  disorders). 


1.2.9  Neuroimaging-Genetics 

A  computational  infrastructure  for  high-throughput  neuroimaging-genetics 
(doi:  https://doi.org/10.3389/fninf.2014.00041)  facilitates  the  data  aggregation,  har¬ 
monization,  processing,  and  interpretation  of  multisource  imaging,  genomic,  clini¬ 
cal,  and  cognitive  data.  A  unique  feature  of  this  architecture  is  the  graphical  user 
interface  to  the  Pipeline  environment.  Through  its  client-server  architecture,  the 
Pipeline  environment  provides  a  graphical  user  interface  for  designing,  executing, 
monitoring,  validating,  and  disseminating  complex  protocols  that  utilize  diverse 
suites  of  software  tools  and  web  services.  These  pipeline  workflows  are  represented 
as  portable  Extensible  Markup  Language  (XML)  objects,  which  transfer  the  execu¬ 
tion  instructions  and  user  specifications  from  the  client  user  machine  to  remote 
pipeline  servers  for  distributed  computing.  Using  Alzheimer’s  and  Parkinson’s 
data,  this  study  provides  examples  of  translational  applications  using  this  infrastruc¬ 
ture  (Figs.  1.8  and  1.9). 
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Fig.  1.9  A  schematic  of  a  distributed  high-throughput  computational  environment  for  managing, 
processing,  and  visualization  of  large,  complex,  and  heterogeneous  biomedical  data 


1.3  Common  Characteristics  of  Big  (Biomedical 
and  Health)  Data 


Software  developments,  student  training,  utilization  of  Cloud  or  IoT  (Internet  of 
Things)  service  platforms,  and  methodological  advances  associated  with  Big  Data 
Discovery  Science  all  present  existing  opportunities  for  learners,  educators, 


1 .5  Predictive  Analytics 
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Table  1.1  The  characteristic  six  dimensions  of  Big  biomedical  and  healthcare  data 


BD  dimensions 

Necessary  techniques,  tools,  services,  and  support  infrastructure 

Size 

Harvesting  and  management  of  vast  amounts  of  data 

Complexity 

Wranglers  for  dealing  with  heterogeneous  data 

Incongruency 

Tools  for  data  harmonization  and  aggregation 

Multisource 

Transfer  and  joint  modeling  of  disparate  elements 

Multiscale 

Macro  to  meso-  to  microscale  observations 

Incomplete 

Reliable  management  of  missing  data 

researchers,  practitioners,  and  policy  makers  alike.  A  review  of  many  biomedical, 
health  informatics,  and  clinical  studies  suggests  that  there  are  indeed  common 
characteristics  of  complex  big  data  challenges.  For  instance,  imagine  analyzing  the 
observational  data  of  thousands  of  Parkinson’s  disease  patients,  based  on  tens  of 
thousands  of  signature  biomarkers  derived  from  multisource  imaging,  genetics,  and 
clinical,  physiologic,  phenomics,  and  demographic  data  elements.  IBM  had  defined 
the  qualitative  characteristics  of  Big  Data  as  4  Vs:  Volume,  Variety,  Velocity,  and 
Veracity  (there  are  additional  V-qualifiers  that  can  be  added). 

More  recently  (PMID:26998309)  we  defined  a  constructive  characterization  of 
Big  Data  that  clearly  identifies  the  methodological  gaps  and  necessary  tools  to 
handle  such  archives,  Table  1.1. 


1.4  Data  Science 

Data  science  is  an  emerging  new  field  that  (1)  is  extremely  transdisciplinary  - 
bridging  between  the  theoretical,  computational,  experimental,  and  biosocial  areas; 
(2)  deals  with  enormous  amounts  of  complex,  incongruent,  and  dynamic  data  from 
multiple  sources;  and  (3)  aims  to  develop  algorithms,  methods,  tools,  and  services 
capable  of  ingesting  such  datasets  and  generating  semiautomated  decision  support 
systems.  The  latter  can  mine  the  data  for  patterns  or  motifs,  predict  expected 
outcomes,  suggest  clustering  or  labeling  of  retrospective  or  prospective  observa¬ 
tions,  compute  data  signatures  or  fingerprints,  extract  valuable  information,  and  offer 
evidence-based  actionable  knowledge.  Data  science  techniques  often  involve  data 
manipulation  (wrangling),  data  harmonization  and  aggregation,  exploratory  or  con¬ 
firmatory  data  analyses,  predictive  analytics,  validation,  and  fine-tuning. 


1.5  Predictive  Analytics 

Predictive  analytics  is  the  process  of  utilizing  advanced  mathematical  formulations, 
powerful  statistical  computing  algorithms,  efficient  software  tools  and  services  to 
represent,  interrogate,  and  interpret  complex  data.  As  its  name  suggests,  a  core  aim 
of  predictive  analytics  is  to  forecast  trends,  predict  patterns  in  the  data,  or 
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prognosticate  the  process  behavior  either  within  the  range  or  outside  the  range  of  the 
observed  data  (e.g.,  in  the  future,  or  at  locations  where  data  may  not  be  available).  In 
this  context,  process  refers  to  a  natural  phenomenon  that  is  being  investigated  by 
examining  proxy  data.  Presumably,  by  collecting  and  exploring  the  intrinsic  data 
characteristics,  we  can  track  the  behavior  and  unravel  the  underlying  mechanism  of 
the  system. 

The  fundamental  goal  of  predictive  analytics  is  to  identify  relationships,  associ¬ 
ations,  arrangements,  or  motifs  in  the  dataset,  in  terms  of  space,  time,  and  features 
(variables)  that  may  prune  the  dimensionality  of  the  data,  i.e.,  reduce  its  complexity. 
Using  these  process  characteristics,  predictive  analytics  may  predict  unknown  out¬ 
comes,  produce  estimations  of  likelihoods  or  parameters,  generate  classification 
labels,  or  contribute  other  aggregate  or  individualized  forecasts.  We  will  discuss 
how  the  outcomes  of  these  predictive  analytics  may  be  refined,  assessed,  and 
compared,  e.g.,  between  alternative  methods.  The  underlying  assumptions  of  the 
specific  predictive  analytics  technique  determine  its  usability,  affect  the  expected 
accuracy,  and  guide  the  (human)  actions  resulting  from  the  (machine)  forecasts.  In 
this  textbook,  we  will  discuss  supervised  and  unsupervised,  model-based  and  model- 
free,  classification  and  regression,  as  well  as  deterministic,  stochastic,  classical,  and 
machine  learning-based  techniques  for  predictive  analytics.  The  type  of  the  expected 
outcome  (e.g.,  binary,  polytomous,  probability,  scalar,  vector,  tensor,  etc.)  deter¬ 
mines  if  the  predictive  analytics  strategy  provides  prediction,  forecasting,  labeling, 
likelihoods,  grouping,  or  motifs. 


1.6  High-Throughput  Big  Data  Analytics 

The  pipeline  environment  provides  a  large  tool  chest  of  software  and  services  that 
can  be  integrated,  merged,  and  processed.  The  Pipeline  workflow  library  and  the 
workflow  miner  illustrate  much  of  the  functionality  that  is  available.  Java-based  and 
HTML5  webapp  graphical  user  interfaces  (GUIs)  provide  access  to  a  powerful  4,000 
core  grid  compute  server  (Fig.  1.10). 


1.7  Examples  of  Data  Repositories,  Archives,  and  Services 

There  are  many  sources  of  data  available  on  the  Internet.  A  number  of  them  provide 
open  access  to  the  data  based  on  FAIR  (Findable,  Accessible,  Interoperable,  Reus¬ 
able)  principles.  Below  are  examples  of  open-access  data  sources  that  can  be  used  to 
test  the  techniques  presented  in  this  textbook.  We  demonstrate  the  tasks  of  retrieval, 
manipulation,  processing,  analytics,  and  visualization  using  example  datasets  from 
these  archives. 

•  SOCR  Wiki  Data,  http://wiki.socr.umich.edu/index.php/SOCR_Data 

•  SOCR  Canvas  datasets,  https://umich.instructure.com/courses/38100/files/folder/ 
data 
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Fig.  1.10  The  pipeline  environment  provides  a  client-server  platform  for  designing,  executing, 
tracking,  sharing,  and  validating  complex  data  analytic  protocols 

•  SOCR  Case-Studies,  http://wiki.socr.umich.edu/index.php/SOCR_Data 

•  XNAT,  https://central.xnat.org 

•  IDA,  http://ida.loni.usc.edu 

•  NIH  dbGaP,  https://dbgap.ncbi.nlm.nih.gov 

•  Data.gov  (http://data.gov) 


1.8  DSPA  Expectations 

The  heterogeneity  of  data  science  makes  it  difficult  to  identify  a  precise  and 
complete  list  of  prerequisites  guaranteeing  deep  and  lasting  understanding  of  all 
the  presented  methods  and  techniques.  However,  the  reader  is  strongly  encouraged 
to  glance  over  the  preliminary  prerequisites,  the  self-assessment  pretest  and  reme¬ 
diation  materials,  and  the  outcome  competencies.  Throughout  this  journey,  it  is 
useful  to  remember  the  following  points : 

•  You  don't  have  to  satisfy  all  prerequisites,  be  versed  in  all  mathematical  foun¬ 
dations,  have  substantial  statistical  analysis  expertise,  or  be  an  experienced 
programmer. 

•  You  don ’t  have  to  complete  all  chapters  and  sections  in  the  order  they  appear  in 
the  DSPA  Topics  Flowchart.  Completing  one,  or  several,  of  the  suggested 
pathways  may  be  sufficient  for  many  readers. 
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1  Motivation 


•  The  DSPA  textbook  aims  to  expand  the  trainees’  horizons,  improve  understand¬ 
ing,  enhance  skills,  and  provide  a  set  of  advanced,  validated,  and  practice- 
oriented  code,  scripts,  and  protocols. 

•  To  varying  degrees,  readers  will  develop  abilities  to  skillfully  utilize  the  tool 
chest  of  resources  provided  in  the  DSPA  textbook.  These  resources  can  be 
revised,  improved,  customized,  expanded,  and  applied  to  other  biomedicine  and 
biosocial  studies,  as  well  as  to  Big  Data  predictive  analytics  challenges  in  other 
disciplines. 

•  The  DSPA  materials  will  challenge  most  readers.  When  the  going  gets  tough , 
seek  help,  engage  with  fellow  trainees,  search  for  help  on  the  DSPA  site  and  the 
Internet,  communicate  via  DSPA  discussion  forum/chat,  and  review  references 
and  supplementary  materials.  Be  proactive!  Remember  that  you  will  gain,  but  it 
will  require  commitment,  prolonged  emersion,  hard  work,  and  perseverance.  If  it 
were  easy,  its  value  would  be  compromised. 

•  When  covering  some  chapters,  some  readers  may  be  underwhelmed  or  bored. 
Feel  free  to  skim  over  chapters  or  sections  that  sound  familiar  and  move  forward 
to  the  next  topic.  Still,  it  is  worth  trying  the  corresponding  assignments  to  ensure 
that  you  have  a  firm  grasp  of  the  material,  and  that  your  technical  abilities  are 
sound. 

•  Although  the  return  on  investment  (e.g.,  time,  effort)  may  vary  between  readers, 
those  that  complete  the  DSPA  textbook  will  discover  something  new,  acquire 
some  advanced  skills,  learn  novel  data  analytic  protocols,  and  may  conceive  of 
cutting-edge  ideas. 

•  The  complete  R  code  ( R  and  Rmd  markdown)  for  all  examples  and  demonstra¬ 
tions  presented  in  this  textbook  are  available  as  electronic  supplements. 

•  The  author  acknowledges  that  these  materials  may  be  improved.  If  you  discover 
problems,  typos,  errors,  inconsistencies,  or  other  problems,  please  contact  us 
(DSPA.info@umich.edu)  to  correct,  expand,  or  polish  the  resources,  accordingly. 
If  you  have  alternative  ideas,  suggestions  for  improvements,  optimized  code, 
interesting  data  and  case-studies,  or  any  other  refinements,  please  send  these 
along,  as  well.  All  suggestions  and  critiques  will  be  carefully  reviewed,  and 
potentially  incorporated  in  revisions  or  new  editions  with  appropriate  credits. 


Chapter  2 

Foundations  of  R 


® 

Check  for 
updates 


This  Chapter  introduces  the  foundations  of  R  programming  for  visualization,  statis¬ 
tical  computing  and  scientific  inference.  Specifically,  in  this  Chapter  we  will  (1)  dis¬ 
cuss  the  rationale  for  selecting  R  as  a  computational  platform  for  all  DSPA 
demonstrations;  (2)  present  the  basics  of  installing  shell-based  R  and  RStudio 
user-interface;  (3)  show  some  simple  R  commands  and  scripts  (e.g.,  translate  long- 
to-wide  data  format,  data  simulation,  data  stratification  and  subsetting);  (4)  introduce 
variable  types  and  their  manipulation;  (5)  demonstrate  simple  mathematical  func¬ 
tions,  statistics,  and  matrix  operators;  (6)  explore  simple  data  visualization;  and 
(7)  introduce  optimization  and  model  fitting.  The  chapter  appendix  includes  refer¬ 
ences  to  R  introductory  and  advanced  resources,  as  well  as  a  primer  on  debugging. 


2.1  Why  Use  R? 

There  are  many  different  classes  of  software  that  can  be  used  for  data  interrogation, 
modeling,  inference,  and  statistical  computing.  Among  these  are  R,  Python,  Java, 
C/C++,  Perl,  and  many  others.  The  table  below  compares  R  to  various  other 
statistical  analysis  software  packages  and  more  detailed  comparison  is  available 
online  (Fig.  2.1),  https://en.wikipedia.org/wiki/Comparison_of_statistical_packages. 

The  reader  may  also  review  the  following  two  comparisons  of  various  statistical 
computing  software  packages: 

•  UCLA  Stats  Software  Comparison 

•  Wikipedia  Stats  Software  Comparison 

Let’s  start  by  looking  at  an  exemplary  R  script  that  shows  the  estimates  of  the 
citations  of  three  statistical  computing  software  packages  over  two  decades 
(1995-2015).  More  details  about  these  command  lines  will  be  presented  in  later 
chapters.  However,  it’s  worth  looking  at  the  four  specific  steps,  each  indicated  by  a 
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Statistical 

Software 

Advantages 

Disadvantages 

R 

R  is  actively  maintained  (>  100,000  developers,  >  15 K 
packages).  Excellent  connectivity  to  various  types  of  data  and 
other  systems.  Versatile  for  solving  problems  in  many  domains. 

It's  free,  open-source  code.  Anybody  can  access/review/extend 
the  source  code.  R  is  very  stable  and  reliable.  If  you  change  or 
redistribute  the  R  source  code,  you  have  to  make  those  changes 
available  for  anybody  else  to  use.  R  runs  anywhere  (platform 
agnostic).  Extensibility:  R  supports  extensions,  e.g.,  for  data 
manipulation,  statistical  modeling,  and  graphics.  Active  and 
engaged  community  supports  R.  Unparalleled  question-and- 
answer  (Q&A)  websites.  R  connects  with  other  languages 
(Java/C/JavaScript/Python/Fortran)  &  database  systems,  and 
other  programs,  SAS,  SPSS,  etc.  Other  packages  have  add-ons  to 
connect  with  R.  SPSS  has  incorporated  a  link  to  R,  and  SAS  has 
protocols  to  move  data  and  graphics  between  the  two  packages. 

Mostly  scripting  language. 
Steeper  learning  curve 

SAS 

Large  datasets.  Commonly  used  in  business  &  Government 

Expensive.  Somewhat  dated 
programming  language. 
Expensive/proprietary 

Stata 

Easy  statistical  analyses 

Mostly  classical  stats 

SPSS 

Appropriate  for  beginners  Simple  interfaces 

Weak  in  more  cutting  edge 
statistical  procedures  lacking 
in  robust  methods  and  survey 
methods 

Fig.  2.1  Comparison  of  several  statistical  software  platforms  (R,  SAS,  Stata,  SPSS) 


Fig.  2.2  Estimated  peer- 
reviewed  publication 
citations  for  R,  SAS  and 
SPSS  softwares 
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line  of  code:  (1)  we  start  by  loading  2  of  the  necessary  R  packages  for  data 
transformation  (reshape2)  and  visualization  (ggplot2);  (2)  loading  the  software 
citation  data  from  the  Internet;  (3)  reformatting  the  data;  and  (4)  displaying  the 
composite  graph  of  citations  over  time  (Fig.  2.2). 
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require (ggpLot2) 
require ( reshape2 ) 

Dato_R_SAS_SPSS_Pubs  <-  read. csv( ' https : //umich . instructure. com/fiies/2361245 
/ down  Load? down Load_frd=l header =T) 
df  <-  data. frame (Data_R_SAS_SPSS_Pubs) 

#  convert  to  long  format  (http: //www. cookbook-r . com/Manipulating_data/Convert 
ing_data_between_wide_and_long_format/) 

df  <-  meLt(df  j  id.vars  =  'Year'j  variable .name  =  'Software ' ) 
ggpLot(data=dfj  aes(x=Yearj  y=vaLue_,  coLor=Softwarej  group  =  Software) )  +  ge 
om_Line()  +  geom_Line(size=4)  +  Labs(x= 'Year ' j  y=' Citations' ) 


2.2  Getting  Started 

2.2.1  Install  Basic  Shell-Based  R 

R  is  a  free  software  that  can  be  installed  on  any  computer.  The  ‘R’  website  is:  http:// 
R-proj  ect.org.  There  you  can  download  the  shell-based  R-environment  following 
this  protocol: 

•  click  download  CRAN  in  the  left  bar 

•  choose  a  download  site 

•  choose  your  operation  system  (e.g.,  Windows,  Mac,  Linux) 

•  click  base 

•  choose  the  latest  version  to  Download  R  (3.4,  or  higher  (newer)  version  for  your 
specific  operating  system,  e.g.,  Windows). 


2.2.2  GUI  Based  R  Invocation  ( RStudio ) 

For  many  readers,  it’s  best  to  also  install  and  run  R  via  RStudio  GUI  (graphical  user 
interface).  To  install  RStudio,  go  to:  http://www.rstudio.org/  and  do  the  following: 

•  click  Download  RStudio 

•  click  Download  RStudio  Desktop 

•  click  Recommended  For  Your  System 

•  download  the  .exe  file  and  run  it  (choose  default  answers  for  all  questions) 


2.2.3  RStudio  GUI  Layout 

The  RStudio  interface  consists  of  several  windows. 

•  Bottom  left :  console  window  (also  called  command  window).  Here  you  can  type 
simple  commands  after  the  “>”  prompt  and  R  will  then  execute  your  command. 
This  is  the  most  important  window,  because  this  is  where  R  actually  does  stuff. 
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•  Top  left :  editor  window  (also  called  script  window).  Collections  of  commands 
(scripts)  can  be  edited  and  saved.  When  you  don’t  get  this  window,  you  can  open 
it  with  File  >  New  >  R  script.  Just  typing  a  command  in  the  editor  window  is  not 
enough;  it  has  to  get  into  the  command  window  before  R  executes  the  command. 
If  you  want  to  run  a  line  from  the  script  window  (or  the  whole  script),  you  can 
click  Run  or  press  CTRL  +  ENTER  to  send  it  to  the  command  window. 

•  Top  right :  workspace  /  history  window.  In  the  workspace  window,  you  can  see 
which  data  and  values  R  has  in  its  memory.  You  can  view  and  edit  the  values  by 
clicking  on  them.  The  history  window  shows  what  has  been  typed  before. 

•  Bottom  right :  files  /  plots  /  packages  /  help  window.  Here  you  can  open  files,  view 
plots  (also  previous  plots),  install  and  load  packages  or  use  the  help  function.  You 
can  change  the  size  of  the  windows  by  dragging  the  grey  bars  between  the 
windows. 


2.2.4  Some  Notes 

•  The  basic  R  environment  installation  comes  with  limited  core  functionality. 
Everyone  eventually  will  have  to  install  more  packages,  e.g.,  reshape2, 
ggplot2,  and  we  will  show  how  to  expand  your  RStudio  library  throughout 
these  materials. 

•  The  core  R  environment  also  has  to  be  upgraded  occasionally,  e.g.,  every 
3-6  months  to  get  R  patches,  to  fix  known  problems,  and  to  add  new  function¬ 
ality.  This  is  also  easy  to  do. 

•  The  assignment  operator  in  R  is  <  -  (although  =  may  also  be  used),  so  to  assign  a 
value  of  2  to  a  variable  x,  we  can  write  x  <  -  2  or  equivalently  x  =  2 . 


2.3  Help 

R  provides  documentations  for  different  R  functions.  The  function  call  to  get  these 
documentations  is  help  ( ) .  Just  put  help  (topic)  in  the  R  console  and  you  can 
get  detailed  explanations  for  each  R  topic  or  function.  Another  way  of  doing  it  is  to 
call  ?topic,  which  is  even  easier,  or  more  generally  ??topic. 

For  example,  if  we  want  to  check  the  function  for  linear  models  (i.e.  function  lm 
( ) ),  we  can  use  the  following  function. 


heLp( Lm) 
?Lm 


2.4  Simple  Wide-to-Long  Data  format  Translation 
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2.4  Simple  Wide-to-Long  Data  format  Translation 

Let’s  start  by  experimenting  with  an  R  script  for  transforming  (melting)  a  simple 
dataset. 


ra\Ajdata_\Ajide  <-  read,  table (header=TRUEj  text=' 
CaselD  Gender  Age  Conditionl  Condition2 

1  M  5  13  10.5 

2  F  6  16  11.2 

3  F  8  10  18.3 

4  M  9  9.5  18.1 

5  M  10  12.1  19 

') 

#  Make  the  CaselD  column  a  factor 

rawdata_\Ajide$subject  <-  factor (rawdata_\Ajide$CaseID) 
rawdata_wide 


## 

CaselD 

Gender 

Age 

Conditionl 

Condition2  subject 

## 

1 

1 

M 

5 

13.0 

10.5  1 

## 

2 

2 

F 

6 

16.0 

11.2  2 

## 

3 

3 

F 

8 

10.0 

18.3  3 

## 

4 

4 

M 

9 

9.5 

18.1  4 

## 

5 

5 

M 

10 

12.1 

19.0  5 

Library ( reshape2) 


#  Specify  id.vars:  the  variables  to  keep  (don't  split  apart  on!) 
meLt(raiAjdata_widej  id. vars=c( "CaselD" j  "Gender")) 


## 

## 

1 

CaselD 

1 

Gender 

M 

variable 

Age 

value 

5 

## 

2 

2 

F 

Age 

6 

## 

3 

3 

F 

Age 

8 

## 

4 

4 

M 

Age 

9 

## 

5 

5 

M 

Age 

10 

## 

6 

1 

M 

Conditionl 

13 

## 

7 

2 

F 

Conditionl 

16 

## 

8 

3 

F 

Conditionl 

10 

## 

9 

4 

M 

Conditionl 

9.5 

## 

10 

5 

M 

Conditionl 

12.1 

## 

11 

1 

M 

Condition2 

10.5 

## 

12 

2 

F 

Condition2 

11.2 

## 

13 

3 

F 

Condition2 

18.3 

## 

14 

4 

M 

Condition2 

18.1 

## 

15 

5 

M 

Condition2 

19 

## 

16 

1 

M 

subject 

1 

## 

17 

2 

F 

subject 

2 

## 

18 

3 

F 

subject 

3 

## 

19 

4 

M 

subject 

4 

## 

20 

5 

M 

subject 

5 
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There  are  options  for  melt  that  can  make  the  output  a  little  easier  to  work  with: 


data_Long  <-  melt(ra\Aidata_\Aiidej 

#  ID  variables  -  all  the  variables  to  keep  but  not  split  apart  on 
id.  vars=c("CaseID" j  "Gender") , 

#  The  source  columns 

measure. vars=c( "Age" j  "Conditionl" ,  "Condition2"  ) } 

#  Name  of  the  destination  column  that  will  identify  the  original 

#  column  that  the  measurement  came  from 
variable .name=" Feature" } 

value . name= "Measurement" 

) 


data_ 

## 

##  1 

long 

CaselD 

1 

Gender 

M 

Feature 

Age 

Measurement 

5.0 

## 

2 

2 

F 

Age 

6.0 

## 

3 

3 

F 

Age 

8.0 

## 

4 

4 

M 

Age 

9.0 

## 

5 

5 

M 

Age 

10.0 

## 

6 

1 

M 

Conditionl 

13.0 

## 

7 

2 

F 

Conditionl 

16.0 

## 

8 

3 

F 

Conditionl 

10.0 

## 

9 

4 

M 

Conditionl 

9.5 

## 

10 

5 

M 

Conditionl 

12.1 

## 

11 

1 

M 

Condition2 

10.5 

## 

12 

2 

F 

Condition2 

11.2 

## 

13 

3 

F 

Condition2 

18.3 

## 

14 

4 

M 

Condition2 

18.1 

## 

15 

5 

M 

Condition2 

19.0 

For  an  elaborate  justification,  detailed  description,  and  multiple  examples  of 
handling  long-and-wide  data,  messy  and  tidy  data,  and  data  cleaning  strategies  see 
the  (JSS  Tidy  Data  article  by  Hadley  Wickham)  [https://www.jstatsoft.org/article/ 
view/v059i!0]. 


2.5  Data  Generation 

Popular  data  generation  functions  are  c  ( ) ,  seq  ( ) ,  rep  ( ) ,  and  data  .  frame  ( ) . 
Sometimes,  we  may  also  use  list  ( )  and  array  ( )  to  generate  data. 

c() 

c  ( )  creates  a  (column)  vector.  With  option  recursive  =  T,  it  descends  through 
lists  combining  all  elements  into  one  vector. 

a<-c(lj  2 ,  3j  5 ,  6j  7 ,  10 ,  1}  4) 
a 

##  [1]  1  2  3  5  6  7  10  1  4 

c(List(A  =  c(Z  =  1,  Y  =  2),  B  =  c(X  =  7),  C  =  c(IaI  =  7,  V=3}  U=-1.9))J  recurs 
ive  =  TRUE) 


##  A.Z  A.Y  B.X  C.IaI  C.V  C.U 

##  1.0  2.0  7.0  7.0  3.0  -1.9 


2.5  Data  Generation 
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When  combined  with  list  ( ) ,  c  ( )  successfully  created  a  vector  with  all  the 
information  in  a  list  with  three  members  A,  B,  and  C. 

seq(from,  to) 

seq  (from,  to)  generates  a  sequence.  Adding  option  by  =  can  help  us  specify 
increment;  Option  length=  specifies  desired  length.  Also,  seq  (along  =  x) 
generates  a  sequence  1,2,  .  .  .  ,  length  (x) .  This  is  used  for  loops  to  create  ID 
for  each  element  in  x. 

seq(lj  20 j  by=0.5) 

##  [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7. 

5 

##  [15]  8.0  8.5  9.0  9.5  10.0  10.5  11.0  11.5  12.0  12.5  13.0  13.5  14.0  14. 

5 

##  [29]  15.0  15.5  16.0  16.5  17.0  17.5  18.0  18.5  19.0  19.5  20.0 

seq(lj  20 ,  Length=9) 

##  [1]  1.000  3.375  5.750  8.125  10.500  12.875  15.250  17.625  20.000 

seq(aLong=c(5j  4,  5j  6)) 

##  [1]  1  2  3  4 

rep(x,  times) 

rep  (x ,  t  imes )  creates  a  sequence  that  repeats  x  a  specified  number  of  times.  The 
option  each  =  allows  us  to  repeat  first  over  each  element  of  x  certain  number  of 
times. 

rep(c(lJ  2j  3)j  4) 

##  [1]  123123123123 

rep(c(lj  2j  3),  each=4) 

##  [1]  111122223333 

Compare  this  to  replicating  using  replicate  ( )  . 

X  <-  seq(aLong=c(lj  2j  3))j  repLicate(4j  X+l) 

##  [,1]  [,2]  [,3]  [j  4] 

##  [1,]  2  2  2  2 

##  [2}  ]  3  3  3  3 

##  [3 }  ]  4  4  4  4 

data.frame() 

data .  frame  ( )  creates  a  data  frame  of  named  or  unnamed  arguments.  We  can 
combine  multiple  vectors.  Each  vector  is  stored  as  a  column.  Shorter  vectors  are 
recycled  to  the  length  of  the  longest  one.  With  data,  frame  ()  you  can  mix 
numeric  and  characteristic  vectors. 
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data,  frame ( v=l  :4,  ch=c("a"J  "B" ,  "C",  "d"),  n=c(10J  11)) 

##  v  ch  n 
##11  a  10 
##22  B  11 
##33  C  10 
##44  d  11 

Note  that  the  1 : 4  means  from  1  to  4.  The  operator  :  generates  a  sequence. 

list() 

Like  we  mentioned  in  function  c  ( ) ,  1  i  s  t  ( )  creates  a  list  of  the  named  or  unnamed 
arguments  -  indexing  rule:  from  1  to  n,  including  1  and  n. 

L<-List(a=c(lj  2)j  b="hi"j  c=-3+3i) 

L 

##  $a 

##  [1]  1  2 
## 

##  $b 

##  [1]  "hi" 

## 

##  $c 

##  [1]  -3+3i 

#  Note  Complex  Numbers  a  <-  -l+3i;  b  <-  -2-2i;  a+b 

We  use  $  to  call  each  member  in  the  list  and  [  []  ]  to  call  the  element 
corresponding  to  specific  index.  For  example, 

L$a[ [2] ] 

##  [1]  2 
i$b 

##  [1]  "hi" 

Note  that  R  uses  1 -based  numbering  rather  than  0-based  like  some  other  lan¬ 
guages  (C/Java),  so  the  first  element  of  a  list  has  index  1 . 

array(x,  dim=) 

array  (x,  dim=)  creates  an  array  with  specific  dimensions.  For  example, 
dim  =  c  (3,  4,  2)  means  two  3x4  matrices.  We  use  []  to  extract  specific 
elements  in  the  array.  [2  ,  3  ,  1  ]  means  the  element  at  the  second  row  third  column 
in  the  first  page.  Leaving  one  number  in  the  dimensions  empty  would  help  us  to  get  a 
specific  row,  column  or  page.  [2  ,  ,1]  means  the  second  row  in  the  1st  page.  See 
this  image  (Fig.  2.3): 


2.5  Data  Generation 
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(1.1.2)  (1,2,2)  (1,3,2)  (1,4,2) 

(2.1.2)  (2,2,2)  (2,3,2)  (2,4,2) 

|  (3,1,2)  (3,3,2)  (3,4,2) 


(1.1.3)  (1,2,3)  (1,3,3)  (1,4,3) 

(2.1.3)  (2,2,3)  (2,3,3)  (2,4,3) 

(3.1.3)  (3,2,5)"  (3,3,3)  (3,4,3) 

(1,4,2)  (4,2,3)  (4,3,3)  (4,4,3) 


(5,2,3)  (5,3,3)  (5,4,3) 


Op  \£,3f£l 

2  (3,1,1)  (3,2,1)  (3,3,1)  (3,4,1) 

3  (4,1,1)  (4,2,1)  (4,3,1)  (4,4,1) 

£  (5,1,1)  (5,2,1)  (5,3,1)  (5,4,1) 


Column  (width) 


Fig.  2.3  Indexing  cell  values  in  multidimensional  arrays  (tensors) 


ar  <-  array (1: 24 j  dim=c(3j  4 ,  2))j  ar 

##  ,  ,  1 
## 

##  [A]  [A]  [A]  [A] 

##  [1}]  1  4  7  10 

##  [2}  ]  2  5  8  11 

##  [3/]  3  6  9  12 

## 

##,,  2 
## 

##  [A]  [A]  [A]  [A] 

##  [lj]  13  16  19  22 

##  [2/]  14  17  20  23 

##  [3/]  15  18  21  24 

°r[2j  3 ,  1] 

##  [1]  8 
ar[2}  A] 

##  [1]  2  5  8  11 

In  general,  multi-dimensional  arrays  are  called  “tensors”  (of  order  =  number  of 
dimensions). 

Other  useful  functions  are: 

•  matrix  (x,  nrow=,  ncol  =  ) :  creates  matrix  elements  of  nrow  rows  and 
ncol  columns. 

•  factor  (x,  level s=) :  encodes  a  vector  x  as  a  factor. 

•  gl  (n,  k,  length=n* *k,  labels  =  l:n):  generate  levels  (factors)  by 
specifying  the  pattern  of  their  levels,  k  is  the  number  of  levels,  and  n  is  the 
number  of  replications. 

•  expand .  gr  id  ( )  :  a  data  frame  from  all  combinations  of  the  supplied  vectors  or 


factors. 


•  rbind  ( )  combine  arguments  by  rows  for  matrices,  data  frames,  and  others. 

•  cbind  ( )  combine  arguments  by  columns  for  matrices,  data  frames,  and  others. 
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2.6  Input/Output  (I/O) 


The  first  pair  of  functions  we  will  talk  about  are  load  ( ) ,  which  helps  us  reload 
datasets  written  with  the  save  ( )  function. 

Let’s  create  some  data  first. 

x  <-  seq(lj  10 j  by=0.5) 
y  <-  List (a  =  1}  b  =  TRUE ,  c  =  "oops") 
save(Xj  y}  fiLe="xy . RData" ) 

Load("xy.RData") 

data  (x)  loads  the  specified  data  sets  and  library  (x)  loads  the  necessary 
add-on  packages. 


data ("iris") 
summary ( iris) 


## 

Sepal. Length 

Sepal . Width 

Petal . Length 

Petal. Width 

## 

Min.  :4.300 

Min. 

: 2.000 

Min.  : 1.000 

Min. 

: 0.100 

## 

1st  Qu. :5. 100 

1st  Qu. :2.800 

1st  Qu. : 1.600 

1st  Qu. : 0.300 

## 

Median  : 5.800 

Median  : 3.000 

Median  :4.350 

Median  : 1.300 

## 

Mean  : 5 . 843 

Mean 

:  3.057 

Mean  :3.758 

Mean 

:  1.199 

## 

3rd  Qu. : 6.400 

3rd  Qu. :3.300 

3rd  Qu. :5.100 

3rd  Qu. : 1.800 

## 

Max.  :7.900 

Max. 

:4.400 

Max.  :6.900 

Max. 

:2. 500 

##  Species 
##  setosa  :50 
##  versicolor : 50 
##  virginica  :50 


read.table(file)  reads  a  file  in  table  format  and  creates  a  data  frame  from  it.  The 
default  separator  sep  =  ""  is  any  whitespace.  Use  header  =  TRUE  to  read  the  first 
line  as  a  header  of  column  names.  Use  as  .  is  =  TRUE  to  prevent  character  vectors 
from  being  converted  to  factors.  Use  comment .  char  =  ""  to  prevent  “#”  from 
being  interpreted  as  a  comment.  Use  skip  =  n  to  skip  n  lines  before  reading  data. 
See  the  help  for  options  on  row  naming,  NA  treatment,  and  others. 

Let’s  use  read .  table  ( )  to  read  a  text  file  in  our  class  file. 


data. txt< -read. table (" https : / /umich . instructure. com/files/1628628/doiA/nloadPd 


o\A)nload_frd=l"j  header=Tj  as. is  =  T)  #  i 

01a_data.txt 

summary ( data. txt ) 

## 

Name 

Team 

Position 

Height 

## 

Length : 1034 

Length : 1034 

Length : 1034 

Min.  :67.0 

## 

Class  character  Class  : character 

Class  : character 

1st  Qu. : 72.0 

## 

Mode  : character  Mode  : character 

Mode  : character 

Median  :74.0 

## 

Mean  :73.7 

## 

3rd  Qu. : 75.0 

## 

Max.  :83.0 

## 

Weight 

Age 

## 

Min.  : 150.0 

Min.  :20. 90 

## 

1st  Qu. :187. 0 

1st  Qu. : 25.44 

## 

Median  : 200.0 

Median  :27 .93 

## 

Mean  : 201 . 7 

Mean  : 28 . 74 

## 

3rd  Qu. :215. 0 

3rd  Qu. :31 . 23 

2.6  Input/Output  (I/O) 


23 


read.csv(" filename”,  header  =  TRUE)  is  identical  to  read,  table  ()  but 
with  defaults  set  for  reading  comma-delimited  files. 


data. csv<-read.csv( "https : //umich . instructure . com/ files/1628650/down Load ?dow 
nLoad_frd=l"j  header  =  T)  #  01_hdp.csv 
summary  (data,  csv) 


## 

tumorsize 

co2 

pain 

wound 

## 

Min. 

33.97 

Min.  : 1.222 

Min.  : 1.000 

Min.  : 1.000 

## 

1st  Qu. 

62.49 

1st  Qu. : 1.519 

1st  Qu. :4. 000 

1st  Qu. :5. 000 

## 

Median 

70.07 

Median  : 1.601 

Median  : 5.000 

Median  :6.000 

## 

Mean 

70.88 

Mean  : 1 . 605 

Mean  : 5.473 

Mean  : 5 . 732 

## 

3rd  Qu. 

79.02 

3rd  Qu. : 1.687 

3rd  Qu.:6.000 

3rd  Qu.:7.000 

## 

Max. 

116.46 

Max.  :2.128 

Max.  :9.000 

Max.  :9.000 

## 

mobility 

ntumors 

nmorphine 

remission 

## 

Min. 

1.00 

Min.  : 0.000 

Min.  :  0.000 

Min.  : 0.0000 

## 


##  1st  Qu.: 5. 00 
##  Median  :6.00 
##  Mean  :6.08 
##  3rd  Qu.: 7. 00 
##  Max.  : 9.  00 
##  Lungcapacity 


1st  Qu. : 1.000 
Median  :3.000 
Mean  :3.066 
3rd  Qu. :5.000 
Max.  :9.000 
Age 


1st  Qu. 
Median 
Mean 
3rd  Qu. 
Max. 


2.000 

3.000 

3.624 

5.000 

18.000 


1st  Qu. :0. 0000 
Median  : 0.0000 
Mean  : 0.2957 
3rd  Qu. : 1.0000 


Married 


Max.  :1 
Family Hx 


0000 

SmokingHx 


## 

Min. 

0.01612 

Min.  :26. 32 

Min. 

:0.0  no  :6820 

current : 1705 

## 

1st  Qu. 

0.67647 

1st  Qu. :46. 69 

1st  Qu. 

:0.0  yes: 1705 

former  :1705 

## 

Median 

0.81560 

Median  :50.93 

Median 

:  1.0 

never  :5115 

## 

Mean 

0.77409 

Mean  : 50.97 

Mean 

:0.  6 

## 

3rd  Qu. 

0.91150 

3rd  Qu. :55.27 

3rd  Qu. 

:  1.0 

## 

Max. 

0.99980 

Max.  : 74.48 

Max. 

:  1.0 

## 

Sex 

CancerStage  LengthofStay 

me 

RBC 

## 

female: 5115 

I 

:2558  Min. 

:  1.000 

Min.  : 2131 

Min.  :3. 919 

## 

male  :3410 

II  : 3409  1st  Qu. 

:  5.000 

1st  Qu. :5323 

1st  Qu. :4.802 

## 

111:1705  Median 

:  5.000 

Median  :6007 

Median  :4.994 

## 

IV  :  853  Mean 

:  5.492 

Mean  :5998 

Mean  :4.995 

## 

3rd  Qu. 

:  6 . 000 

3rd  Qu.:6663 

3rd  Qu. :5. 190 

## 

Max. 

: 10. 000 

Max.  :9776 

Max.  :6.065 

## 

BMI 

IL6 

CRP 

DID 

## 

Min. 

18.38 

Min.  :  0.03521 

Min. 

:  0.0451  Min. 

:  1.0 

## 

1st  Qu. 

24.20 

1st  Qu. :  1.93039 

1st  Qu 

. :  2.6968  1st  Qu. : 100.0 

## 

Median 

27.73 

Median  :  3.34400 

Median 

:  4.3330  Median  : 199.0 

## 

Mean 

29.07 

Mean  :  4.01698 

Mean 

:  4.9730  Mean 

:203. 3 

## 

3rd  Qu. 

32.54 

3rd  Qu. :  5.40551 

3rd  Qu 

. :  6.5952  3rd  Qu. : 309.0 

## 

Max. 

58.00 

Max.  : 23.72777 

Max. 

: 28. 7421  Max. 

:  407.0 

Experience 


School 


Lawsuits 


HID 


## 

Min. 

7.00 

average : 6405 

Min. 

:0. 000 

Min. 

:  1.00 

## 

1st  Qu. 

15.00 

top  :2120 

1st  Qu. : 1.000 

1st  Qu. :  9. 00 

## 

Median 

18.00 

Median  : 2.000 

Median  : 17.00 

## 

Mean 

17.64 

Mean 

: 1.866 

Mean 

:17. 76 

## 

3rd  Qu. 

21.00 

3rd  Qu.:3.000 

3rd  Qu.:27.00 

## 

Max. 

29.00 

Max. 

: 9 . 000 

Max. 

:35 .00 

##  Medicaid 
##  Min.  : 0.1416 
##  1st  Qu. :0. 3369 
##  Median  : 0.5215 
##  Mean  : 0.5125 
##  3rd  Qu. :0.7083 
##  Max.  : 0.8187 
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read.delim("  filename",  header  =  TRUE)  is  very  similar  to  the  first  two. 
However,  it  has  defaults  set  for  reading  tab-delimited  files. 

Also,  we  have  read .  f  wf  (file ,  widths ,  header  =  FALSE ,  sep  =  "\t", 
as  .  is  =  FALSE)  to  read  a  table  of  fixed  width  formatted  data  into  a  data  frame. 

match(x,  y)  returns  a  vector  of  the  positions  of  (first)  matches  of  its  first  argument 
in  its  second.  For  a  specific  element  in  x,  if  no  elements  matches  it  in  y,  the  output 
for  that  elements  would  be  NA. 

match (c(lj  2,  4,  5 ),  c(lJ  4,  4,  5,  6,  7)) 

##  [1]  1  NA  2  4 


save.image(file)  saves  all  objects  in  the  current  work  space. 

write. table(x,  file  =  row.names  =  TRUE,  col.names  =  TRUE,  sep  =  "") 

prints  x  after  converting  to  a  data  frame  and  stores  it  into  a  specified  file.  If  quote  is 
TRUE,  character  or  factor  columns  are  surrounded  by  quotes  (").  sep  is  the  field 
separator,  eol  is  the  end-of-line  separator,  na  is  the  string  for  missing  values.  Use 
col .  names=NA  to  add  a  blank  column  header  to  get  the  column  headers  aligned 
correctly  for  spreadsheet  input. 

Most  of  the  I/O  functions  have  a  file  argument.  This  can  often  be  a  character  string 
naming  a  file  or  a  connection.  File  =  ""  means  the  standard  input  or  output. 
Connections  can  include  files,  pipes,  zipped  files,  and  R  variables.  On  windows,  the 
file  connection  can  also  be  used  with  description  =  "clipboard".  To  read  a 
table  copied  from  Excel,  use  x  <  —  read,  delim  ( "clipboard" ) . 

To  write  a  table  to  the  clipboard  for  Excel,  use  write. table(x,  "clip¬ 
board"  ,  sep  =  "\t",  col.names  =  NA)  .  For  database  interaction,  see 
packages  RODBC,  DBI,  RMySQL,  RPgSQL,  and  ROracle,  as  well  as  packages 
XML,  hdf5,  netCDF  for  reading  other  file  formats.  We  will  talk  about  some  of  them 
in  later  Chapters. 

Note ,  an  alternative  library  called  rio  handles  import/export  of  multiple  data 
types  using  a  simple  syntax. 


2.7  Slicing  and  Extracting  Data 

Table  2.1  shows  us  how  to  index  vectors. 

Indexing  lists  are  similar  to  indexing  vectors,  but  some  of  the  symbols  are 
different  (Table  2.2). 

Indexing  for  matrices  is  a  higher  dimensional  version  of  indexing  vectors 
(Table  2.3). 


2.9  V ariable  Information 
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Table  2.1  Vector  indexing  in  R 


Expression 

Explanation 

x  [n] 

Nth  element 

x  [— n] 

All  but  the  nth  element 

x  [1 :  n] 

First  n  elements 

x  [-  (1  :n)  ] 

Elements  from  n  +  1  to  the  end 

x  [c  (1,  4  ,  2)  ] 

Specific  elements 

x  [  "name " ] 

Element  named  "name" 

x  [x  >  3  ] 

All  elements  greater  than  3 

x  [x  >  3  &  x  <  5] 

All  elements  between  3  and  5 

x[x  %in%  c("a",  "and",  "the")] 

Elements  in  the  given  set 

Table  2.2  List  indexing  in  R 


Expression 

Explanation 

x  [n] 

List  with  n  elements 

x[  [n]  ] 

Nth  element  of  the  list 

x  [  [  "name " ] ] 

Element  of  the  list  named  "name" 

Table  2.3  Matrix  indexing 
inR 


Expression 

Explanation 

x  [i ,  j] 

Element  at  row  i,  column  j 

x  [i,  ] 

Row  i 

x[,  j] 

Column  j 

x  [ ,  c (1,  3 ) ] 

Columns  1  and  3 

x  [  "name " , ] 

Row  named  "name" 

2.8  Variable  Conversion 

The  following  functions  can  be  used  to  convert  data  types: 

as . array (x) ,  as . data . frame (x) ,  as . numeric (x) ,  as . logical (x) , 
as  .  complex  (x) ,  as  .  character  (x) ,  ... 

Typing  methods  (as)  in  the  console  will  generate  a  complete  list  for  variable 
conversion  functions. 


2.9  Variable  Information 

The  following  functions  will  test  if  the  each  data  element  is  a  specific  type: 

is.na(x),  is. null (x),  is.array(x),  is . data . frame (x) ,  is. 
numeric  (x) ,  is  .  complex  (x) ,  is  .  character  (x) ,  ... 

For  a  complete  list,  type  methods  (is)  in  R  console.  The  output  for  these 
functions  are  a  bunch  of  TRUE  or  FALSE  logical  statements.  One  statement  for  one 
element  in  the  dataset. 
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length(x)  gives  us  the  number  of  elements  in  x. 

x< - c (lj  3 ,  10,  23,  1,  3) 

Length (x) 

##  [1]  6 

dim(x)  retrieves  or  sets  the  dimension  of  an  object. 


x<-l:l2 

dim(x)<-c(3j  4) 


##  [ ,1 ]  [,2]  [,3]  [,4] 

##[!,]  1  4  7  10 

##  [2/]  2  5  8  11 

##  [3 j  ]  3  6  9  12 

dimnames(x)  retrieves  or  sets  the  dimension  names  of  an  object.  For  higher 
dimensional  objects  like  matrix  or  arrays  we  can  combine  dimnames  ( )  with  list. 

dimnames(x)<-List(c( "R1 ",  "R2"}  " R3 "),  c("Cl ",  "C2",  "C3",  "C4"))j  x 

##  Cl  C2  C3  C4 

##  R1  1  4  7  10 

##  R2  2  5  8  11 

##  R3  3  6  9  12 

nrow(x)  number  of  rows;  ncol(x)  number  of  columns. 


nrow(x) 

##  [1]  3 
ncoi(x) 

##  [1]  4 

class(x)  get  or  set  the  class  of  x.  Note  that  we  can  use  unclass  (x)  to  remove 
the  class  attribute  of  x. 


cLoss(x) 

##  [1]  "matrix" 

ciass(x)<- "myciass" 

x<-unciass(x) 

x 

##  Cl  C2  C3  C4 

##  R1  1  4  7  10 

##  R2  2  5  8  11 

##  R3  3  6  9  12 


2.10  Data  Selection  and  Manipulation 
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attr(x,  which)  get  or  set  the  attribute  which  of  x. 


attr(Xj  "cLass") 

##  NULL 

ottr(Xj  "dim")<-c(2j  6) 
x 

##  [,1]  [,2]  [j  3]  [,4]  [,5]  [,  6 ] 

##[!,]  1  3  5  7  9  11 

##  [2/]  2  4  6  8  10  12 

From  the  above  commands  we  know  that  when  we  unclass  x,  its  class  would 
be  NULL. 

attributes(obj)  get  or  set  the  list  of  attributes  of  object. 


attributes (x)  <-  List (my comment  =  "reaLLy  special" j  dim  =  3:4} 

dimnames  =  List(LETTERS[l:3]j  Letters [1 :4] ) j  names  =  paste(l:12)) 
x 


##  abed 

##  A  1  4  7  10 

##  B  2  5  8  11 

##  C  3  6  9  12 

##  attr(j  " my  comment " ) 

##  [1]  "reaLLy  special 

##  attr(j  "names") 

##  [1]  "1"  "2"  "3" 


'8‘ 


'10"  "11 


'12 


2.10  Data  Selection  and  Manipulation 

In  this  section,  we  will  introduce  some  data  manipulation  functions.  In  addition, 
tools  from  dplyr  provide  easy  dataset  manipulation  routines. 

which.max(x)  returns  the  index  of  the  greatest  element  of  x.  which.min(x) 
returns  the  index  of  the  smallest  element  of  x.  rev(x)  reverses  the  elements  of 
x.  Let’s  see  these  three  functions  first. 

x< -c (lj  5 ,  2j  1 J  10 J  40 ,  3) 
which. max(x) 

##  [1]  6 

which,  min  (x) 

##  [1]  1 

rev(x) 

##  [1]  3  40  10  1  2  5  1 
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sort(x)  sorts  the  elements  of  x  in  increasing  order.  To  sort  in  decreasing  order  we 
can  use  rev  ( sort  (x)  ) . 

sort(x) 

##  [1]  1  1  2  3  5  10  40 

rev(sort(x)  ) 

##  [1]  40  10  5  3  2  1  1 

cut(x,  breaks)  divides  x  into  intervals  with  same  length  (sometimes  factors), 
breaks  is  the  number  of  cut  intervals  or  a  vector  of  cut  points.  Cut  divides  the 
range  of  x  into  intervals  coding  the  values  in  x  according  to  the  intervals  they 
fall  into. 

x 

##  [1]  1  5  2  1  10  40  3 

cut(Xj  3) 

##  [1]  (0.961,14]  (0.961,14]  (0.961,14]  (0.961,14]  (0.961,14]  (27,40] 

##  [7]  (0.961,14] 

##  Levels :  (0.961,14]  (14,27]  (27,40] 
cut  (x,  c(0,  5,  20,  30)) 

##  [1]  (0,5]  (0,5]  (0,5]  (0,5]  (5,20]  <NA>  (0,5] 

##  Levels:  (0,5]  (5,20]  (20,30] 

which (x  ==  a)  returns  a  vector  of  the  indices  of  x  if  the  comparison  operation  is 
true  (TRUE).  For  example  it  returns  the  value  i,  if  x  [i]  ==  a  is  true.  Thus,  the 
argument  of  this  function  (like  x==a)  must  be  a  variable  of  mode  logical. 

x 

##  [1]  1  5  2  1  10  40  3 

which(x==2) 

##  [1]  3 

na.omit(x)  suppresses  the  observations  with  missing  data  (NA).  It  suppresses  the 
corresponding  line  if  x  is  a  matrix  or  a  data  frame,  na.fail(x)  returns  an  error  message 
if  x  contains  at  least  one  NA. 


2.10  Data  Selection  and  Manipulation 
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d f < -data. frame (a=l: 5j  b=c(lj  3}  NA}  9,  8));  df 

##  a  b 
##11  1 
##2  2  3 

##  3  3  NA 
##4  4  9 

##5  5  8 

na.omit(df) 

##  a  b 
##111 
##223 
##449 
##558 

unique(x)  If  x  is  a  vector  or  a  data  frame,  it  returns  a  similar  object  but  with  the 
duplicate  elements  suppressed. 


df 1< -data. frame (a=c(lj  1}  7 ,  6}  8),  b=c(lj  1,  NA,  9 ,  8)) 
dfl 

##  a  b 
##11  1 
##2  1  1 
##  3  7  NA 
##4  6  9 

##5  8  8 

unique (dfl ) 

##  a  b 
##11  1 
##  3  7  NA 
##4  6  9 

##5  8  8 

table(x)  returns  a  table  with  the  different  values  of  x  and  their  frequencies 
(typically  for  integers  or  factors).  Also  check  prob  .  table  () . 


v< -c (1;  2j  4 ,  2,  2,  5j  6j  4,  7,  8}  8) 
tabLe(v) 

##  v 

##1245678 

##1321112 

subset(x, ...)  returns  a  selection  of  x  with  respect  to  criteria  .  .  .  (typically  .  .  . 
are  comparisons  like  x$Vl  <  10).  If  x  is  a  data  frame,  the  option  select  =  gives 
the  variables  to  be  kept  or  dropped  using  a  minus  sign. 
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sub<-subset(dflj  dfl$a>5);  sub 

##  u  b 
##  3  7  NA 
##4  6  9 
##5  8  8 

sub<-subset(dflj  seLect=-a) 
sub 

##  b 
##11 
##2  1 
##  3  NA 
##4  9 
##5  8 


sample(x,  size)  resamples  randomly  and  without  replacement  size  elements  in  the 
vector  x,  the  option  replace  =  TRUE  allows  to  resample  with  replacement. 


\/ 

##  [1]  12422564788 

sampLe(dfl$aj  20 j  repLace  =  T) 

##  [1]  78161178178167871688 

prop.table(x,  margin=)  table  entries  as  fraction  of  marginal  table. 

prop. table ( table (v) ) 

##  v 

##  1  2  4  5  6  7 

##  0.09090909  0.27272727  0.18181818  0.09090909  0.09090909  0.09090909 
##  8 
##  0.18181818 


2.11  Math  Functions 


Basic  math  functions  like  sin,  cos,  tan,  as  in,  acos,  atan,  atan2,  log, 
loglO,  exp.  and  “set”  functions  union(x,  y) ,  intersect  (x,  y) , 
setdiff (x,  y) ,  setequal  (x,  y) ,  is  .  element  (el ,  set)  are  available 
in  R. 

lsf  .  str  ( "package  :  base " )  displays  all  base  functions  built  in  a  specific  R 
package  (like  base). 

Also  we  have  the  Table  2.4  of  functions  that  you  might  need  when  using  R  for 
calculations. 

Note:  many  math  functions  have  a  logical  parameter  na.rm.  =  FALSE  to 
specify  missing  data  (NA)  removal. 


2.11  Math  Functions 
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Table  2.4  Common  mathematics,  statistics,  and  processing  R  functions 


Expression 

Explanation 

choose (n,  k) 

Computes  the  combinations  of  k  events  among  n  objects.  Mathematically 
it  equals  to  ((„  "k)]k]] 

max (x) 

Maximum  of  the  elements  of  x 

min  (x) 

Minimum  of  the  elements  of  x 

range (x) 

Minimum  and  maximum  of  the  elements  of  x 

sum (x) 

Sum  of  the  elements  of  x 

diff (x) 

Lagged  and  iterated  differences  of  vector  x 

prod (x) 

Product  of  the  elements  of  x 

mean (x) 

Mean  of  the  elements  of  x 

median (x) 

Median  of  the  elements  of  x 

quantile (x, 
probs=) 

Sample  quantiles  corresponding  to  the  given  probabilities  (defaults  to 

0,  0.25,  0.5,  0.75,  1) 

weighted . mean 
(x,  w) 

Mean  of  x  with  weights  w 

rank (x) 

Ranks  of  the  elements  of  x 

var(x)  or 
cov (x) 

Variance  of  the  elements  of  x  (calculated  on  n  >  1).  If  x  is  a  matrix  or  a 
data  frame,  the  variance-covariance  matrix  is  calculated 

sd  (x) 

Standard  deviation  of  x 

cor (x) 

Correlation  matrix  of  x  if  it  is  a  matrix  or  a  data  frame  (1  if  x  is  a  vector) 

var  (x,  y)  or 
cov(x,  y) 

Covariance  between  x  and  y,  or  between  the  columns  of  x  and  those  of  y  if 
they  are  matrices  or  data  frames 

cor (x,  y) 

Linear  correlation  between  x  and  y,  or  correlation  matrix  if  they  are 
matrices  or  data  frames 

round (x,  n) 

Rounds  the  elements  of  x  to  n  decimals 

log (x,  base) 

Computes  the  logarithm  of  x  with  base  "base" 

scale (x) 

If  x  is  a  matrix,  centers  and  reduces  the  data.  Without  centering  use  the 
option  center  =  FALSE.  Without  scaling  use  scale  =  FALSE 
(by  default  center  =  TRUE,  scale  =  TRUE) 

pmin (x,  y ,  .  .  . ) 

a  vector  which  ith  element  is  the  minimum  of  x[i],  y[i],.  .  . 

pmax ( x ,  y ,  .  .  .  ) 

a  vector  which  ith  element  is  the  maximum  of  x[i],  y  [i] , .  .  . 

cumsum (x) 

a  vector  which  ith  element  is  the  sum  from  x[l]  to  x[i] 

cumprod (x) 

id.  for  the  product 

cummin (x) 

id.  for  the  minimum 

cummax (x) 

id.  for  the  maximum 

Re  (x) 

Real  part  of  a  complex  number 

Im  (x) 

Imaginary  part  of  a  complex  number 

Mod (x) 

Modulus.  Abs  (x)  is  the  same 

Arg (x) 

Angle  in  radians  of  the  complex  number 

Conj  (x) 

Complex  conjugate 

convolve (x,  y) 

Compute  the  several  kinds  of  convolutions  of  two  sequences 

f  f  t  (x) 

Fast  Fourier  Transform  of  an  array 

mvf ft (x) 

FFT  of  each  column  of  a  matrix 

filter  (x, 
filter) 

Applies  linear  filtering  to  a  univariate  time  series  or  to  each  series 
separately  of  a  multivariate  time  series 
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2.12  Matrix  Operations 

The  following  table  summarizes  basic  operation  functions.  We  will  discuss  this  topic 
in  detail  in  Chap.  5  (Table  2.5). 

matl  <-  cbind(c(l,  -1/5),  c(-l/3,  1)) 
matl.inv  <-  solve(matl) 

matl. identity  <-  matl.inv  %*%  matl 
matl . identity 

##  [,1]  [,2] 

##  [lj]  1  0 

##  [2/]  0  1 

b  <-  c (1,  2) 

x  <-  solve  (matl,  b) 

x 

##  [1]  1.785714  2.357143 


2.13  Advanced  Data  Processing 

In  this  section,  we  will  introduce  some  fancy  functions  that  can  save  time 
remarkably. 

apply(X,  INDEX,  FUN=)  a  vector  or  array  or  list  of  values  obtained  by  applying 
a  function  FUN  to  margins  (INDEX  =  1  means  row,  INDEX  =  2  means  column) 
of  X. 


Table  2.5  Common  R  operators 


Expression 

Explanation 

t  (x) 

Transpose 

diag (x) 

Diagonal 

%*% 

Matrix  multiplication 

solve (a,  b) 

Solves  a%*%x  =  bforx 

solve (a) 

Matrix  inverse  of  a 

rowsum (x) 

Sum  of  rows  for  a  matrix-like  object.  rowSums  (x) 
is  a  faster  version 

colSums (x) 

id.  for  columns 

rowMeans (x) 

Fast  version  of  row  means 

colMeans (x) 

id.  for  columns 

2.13  Advanced  Data  Processing 
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dfl 

##  a  b 
##11  1 
##2  1  1 
##  3  7  NA 
##4  6  9 

##5  8  8 

apply (dfl,  2 ,  mearij  no.rm=T) 

##  a  b 
##  4.60  4.75 

Note  that  we  can  add  options  for  the  FUN  after  the  function. 
lapply(X,  FUN)  apply  FUN  to  each  member  of  the  list  X.  If  X  is  a  data  frame, 
it  will  apply  the  FUN  to  each  column  and  return  a  list. 


Lapp Ly (dfl ,  mean,  na.rm=T) 

##  $a 

##  [1]  4.6 
## 

##  $b 

##  [1]  4.75 

LappLy(List(a=c(l,  23 ,  5,  6,  1 ),  b=c(9 ,  90,  999)),  median) 

##  $a 
##  [1]  5 
## 

##  $b 
##  [1]  90 

tapply(X,  INDEX,  FUN=)  apply  FUN  to  each  cell  of  a  ragged  array  given  by  X 
with  indexes  equals  to  INDEX.  Note  that  X  is  an  atomic  object,  typically  a  vector. 


V 

##  [1]  12422564788 

fac  <-  factor (rep(l: 3,  Length  =  11),  LeveLs  =  1:3 ) 
tabLe(fac) 

##  fac 

##123 

##443 

tappLy(v,  fac,  sum) 

##123 
##  17  16  16 
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by(data,  INDEX,  FUN)  apply  FUN  to  data  frame  data  subsetted  by  INDEX. 


by(dflj  dfl[j  l]j  sum) 

##  dfl[j  1]:  1 
##  [1]  4 

## - 

##  dfl[j  1]:  6 
##  [1]  15 

## - 

##  dfl[j  1 ]:  7 
##  [1]  NA 

## - 

##  dfl[j  1]:  8 
##  [1]  16 

This  code  applies  the  sum  function  to  dfl  using  column  1  as  an  index. 
merge(a,  b)  merge  two  data  frames  by  common  columns  or  row  names.  We  can 
use  option  by  =  to  specify  the  index  column. 


df 2< -data. frame (a=c(lj  1,  7}  6,  8) ,  c=l:5) 
dfl 

##  a  c 
##111 
##212 
##373 
##464 
##585 

df3< -merge ( dflj  df2}  by= "a ") 
df3 

##  a  b  c 
##11  11 
##21  12 
##31  11 

##41  12 

##56  94 

##  6  7  NA  3 
##78  85 


2.13  Advanced  Data  Processing 
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xtabs(a  ~  b,  data  =  x)  a  contingency  table  from  cross-classifying  factors. 

DF  <-  as. data. frame (UCBAdmissions) 

##  'DF '  is  a  data  frame  with  a  grid  of  the  factors  and  the  counts 
##  in  variable  'Freq'. 

DF 

##  Admit  Gender  Dept  Freq 

##  1  Admitted  Male  A  512 

##  2  Rejected  Male  A  313 

##  3  Admitted  Female  A  89 

##  23  Admitted  Female  F  24 

##  24  Rejected  Female  F  317 

##  Nice  for  taking  margins  . . . 
xtabs(Freq  ~  Gender  +  Admit j  DF) 

##  Admit 

##  Gender  Admitted  Rejected 
##  Male  1198  1493 

##  Female  557  1278 

##  And  for  testing  independence  . . . 
summary (xtabs( Freq  ~  . DF)) 

##  Call:  xtabs (formula  =  Freq  ~  data  =  DF) 

##  Number  of  cases  in  table:  4526 

##  Number  of  factors:  3 

##  Test  for  independence  of  all  factors: 

##  Chisq  =  2000. 3 j  df  =  16 j  p- value  =  0 

aggregated,  by,  FUN)  splits  the  data  frame  x  into  subsets,  computes  summary 
statistics  for  each,  and  returns  the  result  in  a  convenient  form,  by  is  a  list  of  grouping 
elements,  that  each  have  the  same  length  as  the  variables  in  x. 

List(rep(l : 3j  Length=7) ) 

##  [[1]] 

##  [1]  1  2  3  1  2  3  1 

aggregate (df 3 j  by=list(rep(l : 3}  Length=7))j  sum) 


## 

Group. 1 

a 

b 

c 

## 

1 

1 

10 

10 

8 

## 

2 

2 

7 

10 

6 

## 

3 

3 

8 

NA 

4 

The  above  code  applied  the  function  sum  to  data  frame  df  3  according  to  the 
index  created  by  list  (rep  (1:3,  length=7)  ) . 

stack(x, ...)  transform  data,  stored  as  separate  columns  in  a  data  frame  or  a  list, 
into  a  single  column  and  unstack  (x,  .  .  . )  is  the  inverse  of  stack  ( ) . 
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stack(df3) 

##  values 

ind 

##  1 

1 

a 

##  2 

1 

a 

##  3 

1 

a 

##  20 

3 

c 

##  21 

5 

c 

unstock( stock ( df3 ) ) 


## 

a 

b 

c 

## 

1 

1 

1 

1 

## 

2 

1 

1 

2 

## 

3 

1 

1 

1 

## 

4 

1 

1 

2 

## 

5 

6 

9 

4 

## 

6 

7 

NA 

3 

## 

7 

8 

8 

5 

reshape(x,  ...)  reshapes  a  data  frame  between  "wide"  format  with  repeated 
measurements  in  separate  columns  of  the  same  record  and  "long"  format  with  the 
repeated  measurements  in  separate  records.  Use  direction  =  "wide"  or 
direction  =  "long". 


df4  <-  data. frame (school  =  rep(l:3j  each  =  4) ,  class  =  rep(9:10j  6 ), 

time  =  rep(c(lj  1,  2 ,  2) }  3) }  score  =  rnorm(12) ) 
wide  <-  reshape(df4j  idvar  =  c( "school" j  "class" )}  direction  =  "wide") 
wide 


## 

## 

1 

school 

1 

class 

9 

score. 1 

-0.1575202 

## 

2 

1 

10 

0.5804452 

## 

5 

2 

9 

0.1553872 

## 

6 

2 

10 

-0.7540783 

## 

9 

3 

9 

-0.6490757 

## 

10 

3 

10 

-0.2122064 

score. 2 
-1.415503816 
1.754559537 
1 . 693809827 
0.478035367 
-0.002922609 
0.276259031 


Long  <-  reshape(widej  idvar  =  c( "school" j  "class" ),  direction  =  "Long") 
long 


## 

school 

class 

time 

score. 1 

##  1.9.1 

1 

9 

1 

-0.157520208 

##  1.10.1 

1 

10 

1 

0. 580445243 

##  2.9.1 

2 

9 

1 

0.155387189 

##  2.10.1 

2 

10 

1 

-0. 754078345 

##  3.9.1 

3 

9 

1 

-0.649075721 

##  3.10.1 

3 

10 

1 

-0.212206430 

##  1.9.2 

1 

9 

2 

-1.415503816 

##  1.10.2 

1 

10 

2 

1.754559537 

##  2.9.2 

2 

9 

2 

1 . 693809827 

##  2.10.2 

2 

10 

2 

0.478035367 

##  3.9.2 

3 

9 

2 

-0.002922609 

##  3.10.2 

3 

10 

2 

0.276259031 

2.14  Strings 
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Notes 

•  The  x  in  this  function  has  to  be  longitudinal  data. 

•  The  call  to  rnorm  used  in  reshape  might  generate  different  results  for  each  call, 
unless  set .  seed  ( 12  34 )  is  used  to  ensure  reproducibility  of  random-number 
generation. 


2.14  Strings 

The  following  functions  are  useful  for  handling  strings  in  R. 

paste(...)  concatenates  vectors  after  converting  to  character.  It  has  a  few  options. 
Sep  =  is  the  string  to  separate  terms  (a  single  space  is  the  default),  collapse  =  is 
an  optional  string  to  separate  “collapsed”  results. 

a<- "today" 
b<-"is  a  good  day" 
paste (a j  b) 

##  [1]  "today  is  a  good  day" 
paste (a j  b}  sep=" }  ") 

##  [1]  " today j  is  a  good  day" 

substr(x,  start,  stop)  substrings  in  a  character  vector.  It  can  also  assign  values 
(with  the  same  length)  to  part  of  a  string,  as  substr  (x,  start,  stop)  <  — 
value. 

a<-"lAlhen  the  going  gets  tough  ,  the  tough  get  going!" 
substr (aj  10 j  40) 

##  [1]  "going  gets  tough j  the  tough  get" 

substr ( aj  1 ,  9) <-" . " 

a 

##  [1]  " . going  gets  tough ,  the  tough  get  going!" 

Note  that  characters  at  start  and  stop  indexes  are  inclusive  in  the  output. 
strsplit(x,  split)  split  x  according  to  the  substring  split.  Use  fixed  =  TRUE  for 
non-regular  expressions. 

strspLit("a.b.c"j  fixed  =  TRUE) 

##  [[1]] 

##  [1]  "a"  "b"  "c" 

grep(pattern,  x)  searches  for  matches  to  pattern  within  x.  It  will  return  a  vector  of 
the  indices  of  the  elements  of  x  that  yielded  a  match.  Use  regular  expression  for 
pattern(unless  fixed  =  TRUE).  See  ?regex  for  details. 
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Letters 

##  [1]  "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "i"  "j"  "k"  "L"  "m"  "n"  "o"  "p"  "q" 
##  [18]  "r"  "s"  "t"  "u"  "v"  "w"  "x"  "y"  "z" 

grep(" [a-z]  ",  Letters) 

##  [1]  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23 

##  [24]  24  25  26 

gsub(pattern,  replacement,  x)  replacement  of  matches  determined  by  regular 
expression  matching.  sub()  is  the  same  but  only  replaces  the  first  occurrence. 

a<-c("e"j  0,  "kj",  10,  ";") 
gsub(  "  [a-z]  ",  "Letters",  a) 

##  [1]  "Letters"  "0"  " LettersLetters"  "10" 

##  [5] 

sub  ("[a-z]",  "Letters",  a) 

##  [1]  "Letters"  "0"  "Lettersj"  "10" 

tolower(x)  convert  to  lowercase,  toupper(x)  convert  to  uppercase. 
match(x,  table)  a  vector  of  the  positions  of  first  matches  for  the  elements  of  x 
among  table,  with  a  short  hand  x  %in%  table,  which  returns  a  logical  vector. 

x<  -  c  (1,  2,  10,  19,  29) 
match (x,  c(l,  10)) 

##  [1]  1  NA  2  NA  NA 

x  %in%  c(l,  10) 

##  [1]  TRUE  FALSE  TRUE  FALSE  FALSE 
pmatch(x,  table)  partial  matches  for  the  elements  of  x  among  table. 
pmatch("m",  c("mean",  "median",  "mode"))  #  returns  NA 
##  [1]  NA 

pmatch("med",  c("mean",  "median",  "mode"))  #  returns  2 
##  [1]  2 

The  first  one  returns  NA,  and  dependent  on  the  R-version,  possibly  a  warning, 
because  all  elements  have  the  pattern  MmM . 
nchar(x)  number  of  characters  in  x. 

Dates  and  Times 

The  class  Date  has  dates  without  times.  POSIXct  ( )  has  dates  and  times,  includ¬ 
ing  time  zones.  Comparisons  (e.g.  >),  seq(),  and  difftimeO  are  useful. 
?DateTimeClasses  gives  more  information.  See  also  package  chron. 

as  .  Date  (s)  and  as  .  POSIXct  (s)  convert  to  the  respective  class;  format 
( dt )  converts  to  a  string  representation.  The  default  string  format  is  2001-02-21. 


2.15  Plotting 


39 


Table  2.6  R  date  formatting  specifications 


Formats 

Explanations 

%a,  %A 

Abbreviated  and  full  weekday  name. 

%b,  %B 

Abbreviated  and  full  month  name. 

%d 

Day  of  the  month  (01  ...  31). 

%H 

Hours  (00  ...  23). 

%I 

Hours  (01  ...  12). 

%j 

Day  of  year  (001  ...  366). 

%m 

Month  (01  ...  12). 

%M 

Minute  (00  ...  59). 

%P 

AM/PM  indicator. 

%S 

Second  as  decimal  number  (00  ...  61). 

%U 

Week  (00  ...  53);  the  first  Sunday  as  day  1  of  week  1. 

%w 

Weekday  (0  ...  6,  Sunday  is  0). 

%W 

Week  (00  ...  53);  the  first  Monday  as  day  1  of  week  1. 

%y 

Year  without  century  (00  ...  99).  Don’t  use. 

%Y 

Year  with  century. 

%  z  (output  only) 

Offset  from  Greenwich;  —0800  is  8  hours  west  of  Greenwich  Meridian. 

%Z  (output  only) 

Time  zone  as  a  character  string  (empty  if  not  available). 

These  accept  a  second  argument  to  specify  a  format  for  conversion.  Some  common 
formats  are  (Table  2.6): 

Where  leading  zeros  are  shown  they  will  be  used  on  output  but  are  optional  on 
input.  See  ?strf time  for  details. 


2.15  Plotting 

This  is  only  an  introduction  for  plotting  functions  in  R.  In  Chap.  4,  we  will  discuss 
visualization  in  more  detail. 

plot(x)  plot  of  the  values  of  x  (on  the  y-axis)  ordered  on  the  x-axis. 
plot(x,  y)  bivariate  plot  of  x  (on  the  x-axis)  and  y  (on  the  y-axis). 
hist(x)  histogram  of  the  frequencies  of  x. 

barplot(x)  histogram  of  the  values  of  x.  Use  horiz  =  FALSE  for  horizontal 
bars. 

dotchart(x)  if  x  is  a  data  frame,  plots  a  Cleveland  dot  plot  (stacked  plots  line-by¬ 
line  and  column-by-column). 
pie(x)  circular  pie-chart. 
boxplot(x)  ‘box-and-whiskers’  plot. 

sunflowerplot(x,  y)  id.  than  plot()  but  the  points  with  similar  coordinates  are 
drawn  as  flowers  which  petal  number  represents  the  number  of  points. 

stripplot(x)  plot  of  the  values  of  x  on  a  line  (an  alternative  to  boxplot  ( )  for 
small  sample  sizes). 
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coplot(x~y  I  z)  bivariate  plot  of  x  and  y  for  each  value  or  interval  of  values  of  z. 
interaction.plot  (fl,  f2,  y)  if  fl  and  f2  are  factors,  plots  the  means  of  y  (on  the 
y-axis)  with  respect  to  the  values  of  fl  (on  the  x-axis)  and  of  f2  (different  curves). 
The  option  fun  allows  choosing  the  summary  statistic  of  y  (by  default 
fun  =  mean). 

matplot(x,  y)  bivariate  plot  of  the  first  column  of  x  vs.  the  first  one  of  y,  the 
second  one  of  x  vs.  the  second  one  of  y,  etc. 

fourfoldplot(x)  visualizes,  with  quarters  of  circles,  the  association  between  two 
dichotomous  variables  for  different  populations  (x  must  be  an  array  with  dim  =  c 
(2,  2,  k),  or  a  matrix  with  dim  =  c(2,  2)  if  k  =  1). 

assocplot(x)  Cohen’s  Friendly  graph  shows  the  deviations  from  independence  of 
rows  and  columns  in  a  two  dimensional  contingency  table. 

mosaicplot(x)  “mosaic”  “graph  of  the  residuals  from  a  log-linear  regression  of  a 
contingency  table. 

pairs(x)  if  x  is  a  matrix  or  a  data  frame,  draws  all  possible  bivariate  plots  between 
the  columns  of  x. 

plot.ts(x)  if  x  is  an  object  of  class  “ts”,  it  plots  x  with  respect  to  time,  x  may  be 
multivariate  but  the  series  must  have  the  same  frequency  and  dates.  Detailed 
examples  are  in  Chap.  19:  Big  Longitudinal  Data  Analysis. 

ts.plot(x)  id.  but  if  x  is  multivariate  the  series  may  have  different  dates  and  must 
have  the  same  frequency. 

qqnorm(x)  quantiles  of  x  with  respect  to  the  values  expected  under  a  normal  law. 
qqplot(x,  y)  quantiles  of  y  with  respect  to  the  quantiles  of  x. 
contour(x,  y,  z)  contour  plot  (data  are  interpolated  to  draw  the  curves),  x  and  y 
must  be  vectors  and  z  must  be  a  matrix  so  that  dim(z)  =  c  ( length  (x)  , 
length  (y)  )  (x  and  y  may  be  omitted). 

filled.contour(x,  y,  z)  areas  between  the  contours  are  colored,  and  a  legend  of  the 
colors  is  drawn  as  well. 

image(x,  y,  z)  plotting  actual  data  with  colors. 
persp(x,  y,  z)  plotting  actual  data  in  perspective  view. 

stars(x)  if  x  is  a  matrix  or  a  data  frame,  draws  a  graph  with  segments  or  a  star 
where  each  row  of  x  is  represented  by  a  star  and  the  columns  are  the  lengths  of  the 
segments. 

symbols(x,  y,  ...)  draws,  at  the  coordinates  given  by  x  and  y,  symbols  (circles, 
squares,  rectangles,  stars,  thermometers  or  “boxplots”“)  which  sizes,  colors,  etc.  are 
specified  by  supplementary  arguments. 

termplot(mod.obj)  plot  of  the  (partial)  effects  of  a  regression  model  (mod .  ob  j ). 
The  following  parameters  are  common  to  many  plotting  functions  (Table  2.7): 


2.16  QQ  Normal  Probability  Plot 
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Table  2.7  Basic  R  plotting  parameters 


Parameters 

Explanations 

add=FALSE 

If  TRUE  superposes  the  plot  on  the  previous  one  (if  it  exists) 

axes=TRUE 

If  FALSE  does  not  draw  the  axes  and  the  box 

type  =  "p" 

Specifies  the  type  of  plot,  "p":  Points,  "1":  Lines,  "b":  Points  connected  by  lines, 
"o":  Id.  But  the  lines  are  over  the  points,  "h":  Vertical  lines,  "s":  Steps,  the  data 
are  represented  by  the  top  of  the  vertical  lines,  "S":  Id.  However,  the  data  are 
represented  at  the  bottom  of  the  vertical  lines 

xlim=, 

ylim= 

Specifies  the  lower  and  upper  limits  of  the  axes,  for  example  with  xlim  =  c 
(1,  10)  or  xlim  =  range  (x) 

xlab=, 

ylab= 

Annotates  the  axes,  must  be  variables  of  mode  character 

main= 

Main  title,  must  be  a  variable  of  mode  character 

sub= 

Subtitle  (written  in  a  smaller  font) 

2.16  QQ  Normal  Probability  Plot 

Let’s  look  at  one  simple  example  -  quantile-quantile  probability  plot.  Suppose 
X  ~  N(0, 1)  and  Y  ~  Cauchy  represent  the  observed/raw  and  simulated/synthetic 
data  for  one  feature  (variable)  in  the  data  (Figs.  2.4,  2.5,  2.6  and  2.7). 


X  <-  rnorm(1000) 

Y  <-  rcauchy(1500) 

#  compare  X  to  StdNormal  distribution 
qqnorm(Xj 

main=" Normal  Q-Q  Plot  of  the  data"j 

xlab= "Theoretical  Quantiles  of  the  Normal" j 

ylab="Sample  Quantiles  of  the  X  (Normal)  Data") 

qqline(X) 


qqplot(XJ  Y) 


#  Y  against  StdNormal 
qqnorm(Yj 

main=" Normal  Q-Q  Plot  of  the  data"j 
xlab="Theoretical  Quantiles  of  the  Normal" j 

ylab="Sample  Quantiles  of  the  Y  (Cauchy)  Data" j  ylim=  range (-4,  4)) 
#  Why  is  the  y-range  specified  here? 
qqline(Y) 


#  Q-Q  plot  data  (X)  vs.  simulation(Y) 
myQQ  <-  function(Xj  y}  ...)  { 

#rang  <-  range(x,  na.rm=T) 
rang  <-  range(-4j  4,  na.rm=T) 
qqplot(Xj  y}  xlim=rangj  ylim=rang) 

} 

myQQ(Xj  Y)  #  where  the  Y  is  the  newly  simulated  data  for  X 
qqline(X) 
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Fig.  2.4  Quantile-quantile 
plot  comparing  the  sample 
distribution  to  normal 
distribution 


Normal  Q-Q  Plot  of  the  data 


Theoretical  Quantiles  of  the  Normal 


Fig.  2.5  Comparing  a 
Cauchy  sample  distribution 
to  Normal  sample 
distribution  via  Q-Q  plot 


X 


Fig.  2.6  Comparing 
Cauchy  sample  to  Normal 
quantiles 


Normal  Q-Q  Plot  of  the  data 


Theoretical  Quantiles  of  the  Normal 


2.16  QQ  Normal  Probability  Plot 
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Fig.  2.7  Using  isotropic 
scales  to  compare  Cauchy 
sample  to  Normal  quantiles 


-4  -2  0  2  4 

x 


#  Subsampling 

x  <-  matrix(rnorm(100) j  ncoL  =  5) 

y  <-  c(lj  seq(19)) 

z  <-  cbind(Xj  y) 

z.df  <-  data.frame(z) 

z.df 


## 

1/1 

1/2 

1/3 

1/4 

1/5 

y 

## 

1 

-0.5202336 

0.5695642 

-0.8104910 

-0. 775492348 

1.8310536 

1 

## 

2 

-1.4370163 

-3.0437691 

-0.4895970 

-0.018963095 

2.2980451 

1 

## 

3 

1.1510882 

-1.5345341 

-0.5443915 

1.176473324 

-0.9079013 

2 

## 

4 

0.2937683 

-1.1738992 

1.1329062 

0.050817201 

-0.1975722 

3 

## 

5 

0.1011329 

1.1382172 

-0.3353099 

1 . 980538873 

-1.4902878 

4 

## 

6 

-0.3842767 

1 . 7629568 

-0.1734520 

0.009448173 

0.4166688 

5 

## 

7 

-0.1897151 

-0.2928122 

0.9917801 

0.147767309 

-0.3447306 

6 

## 

3 

-1.5184068 

-0.6339424 

-1.4102368 

0.471592965 

1 . 0748895 

7 

## 

9 

-0.6475764 

0.3884220 

1.5151532 

-1.977356193 

-0.9561620 

8 

## 

10 

0.1476949 

-0.2219758 

0.6255156 

-0.755406330 

-0.3411347 

9 

## 

11 

1.1927071 

-0.2031697 

0.6926743 

1 . 263878207 

-0.2628487 

10 

## 

12 

0.6117842 

-0.3206093 

-1.0544746 

0.074048308 

-0.3483535 

11 

## 

13 

1 . 7865743 

-0.9457715 

-0.2907310 

1.520606318 

2.3182403 

12 

## 

14 

-0.2075467 

0.6440087 

0.6277978 

-1.670570757 

0.1356807 

13 

## 

15 

0.2087459 

1 . 2049360 

1 . 2614003 

1.102632278 

0.4413631 

14 

## 

16 

-0.8663415 

-0.4149625 

1 . 3974565 

0.432508163 

-0.7408295 

15 

## 

17 

-0.4808447 

0.6163081 

-0.8693709 

-0.830734957 

-0.2094428 

16 

## 

13 

-0.3456697 

2.5622196 

-0.9398627 

0.363765941 

-1.4032376 

17 

## 

19 

1.1240451 

-0.1887518 

-0.6514363 

-0.988661412 

-1.2906608 

18 

## 

20 

-0.9783920 

1 . 0246003 

-0.6001832 

-0.568181332 

0.2374808 

19 

names  (z.df) 

##  [1]  "VI"  "V2 "  "1/3"  "1/4"  "1/5"  "y" 

#  subsetting  rows 

z.sub  <-  subset(z.dfj  y  >  2  &  (y<10  /  V1>0)) 
z.sub 
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## 

VI 

V2 

V3 

V4 

V5 

y 

## 

4  0. 2937683 

-1.1738992 

1.1329062 

0.050817201 

-0.1975722 

3 

## 

5  0.1011329 

1.1382172 

-0.3353099 

1 . 980538873 

-1.4902878 

4 

## 

6  -0.3842767 

1 . 7629568 

-0.1734520 

0.009448173 

0.4166688 

5 

## 

7  -0.1897151 

-0.2928122 

0.9917801 

0.147767309 

-0.3447306 

6 

## 

8  -1.5184068 

-0.6339424 

-1.4102368 

0.471592965 

1 . 0748895 

7 

## 

9  -0.6475764 

0.3884220 

1.5151532 

-1.977356193 

-0.9561620 

8 

## 

10  0.1476949 

-0.2219758 

0.6255156 

-0.755406330 

-0.3411347 

9 

## 

11  1.1927071 

-0.2031697 

0.6926743 

1 . 263878207 

-0.2628487 

10 

## 

12  0.6117842 

-0.3206093 

-1.0544746 

0.074048308 

-0.3483535 

11 

## 

13  1.7865743 

-0.9457715 

-0.2907310 

1.520606318 

2.3182403 

12 

## 

15  0.2087459 

1 . 2049360 

1 . 2614003 

1.102632278 

0.4413631 

14 

## 

19  1.1240451 

-0.1887518 

-0.6514363 

-0.988661412 

-1.2906608 

18 

z.subl  <-  z.df[z.df$y  ==  1, 

] 

z.subl 

## 

VI 

V2 

V3 

V4 

V5  y 

## 

1  -0.5202336 

0.5695642 

-0.810491  -0 

77 549235  1.831054  1 

## 

2  -1.4370163  - 

■3.0437691 

-0.489597  -0 

01896309  2.298045  1 

z.sub2  <-  z.df[z.df$y  %in% 

4),  ] 

z.sub2 

## 

VI 

V2 

V3 

V4 

V5  y 

## 

1  -0.5202336 

0.5695642 

-0.8104910  - 

0.77549235  1 

.831054  1 

## 

2  -1.4370163  - 

-3.0437691 

-0.4895970  - 

0.01896309  2 

.298045  1 

## 

5  0.1011329 

1.1382172 

-0.3353099 

1.98053887  -1 

.490288  4 

#  subsetting  columns 
z.sub6  <-  z.df[j  1:2] 
z.sub6 


## 

VI 

V2 

## 

1 

-0.5202336 

0.5695642 

## 

2 

-1.4370163 

-3.0437691 

## 

3 

1.1510882 

-1.5345341 

## 

4 

0.2937683 

-1.1738992 

## 

5 

0.1011329 

1.1382172 

## 

6 

-0.3842767 

1 . 7629568 

## 

7 

-0.1897151 

-0.2928122 

## 

8 

-1.5184068 

-0.6339424 

## 

9 

-0.6475764 

0.3884220 

## 

10 

0.1476949 

-0.2219758 

## 

11 

1.1927071 

-0.2031697 

## 

12 

0.6117842 

-0.3206093 

## 

13 

1 . 7865743 

-0.9457715 

## 

14 

-0.2075467 

0.6440087 

## 

15 

0.2087459 

1 . 2049360 

## 

16 

-0.8663415 

-0.4149625 

## 

17 

-0.4808447 

0.6163081 

## 

18 

-0.3456697 

2.5622196 

## 

19 

1.1240451 

-0.1887518 

## 

20 

-0.9783920 

1 . 0246003 

2.18  Graphics  Parameters 
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2.17  Low-Level  Plotting  Commands 

points(x,  y)  adds  points  (the  option  type  =  can  be  used). 
lines(x,  y)  a  line  plot  of  (x,y)  pairs. 

text(x,  y,  labels, ...)  adds  text  given  by  labels  at  coordinates  (x,  y).  Typical  use: 
plot  (x,  y,  type  =  "n" )  ;  text  (x,  y ,  names) . 

mtext(text,  side  =  3,  line  =  0, ...)  adds  text  given  by  text  in  the  margin  specified 
by  side  (see  axis  ( )  below);  line  specifies  the  line  from  the  plotting  area. 

segments(xO,  yO,  xl,  yl)  draws  lines  from  points  (xO,  yO)  to  points  (xl, 

yD- 

arrows(x0,  yO,  xl,  yl,  angle  =  30,  code  =  2)  draw  arrows  between  pairs  of 
points.  With  arrows  at  points  (xO  ,  yO ) ,  if  code  =  2,  or  at  point  (xl ,  yl) ,  if 
code  =  1.  Arrows  are  at  both  if  code  =  3.  Angle  controls  the  angle  from  the  shaft 
of  the  arrow  to  the  edge  of  the  arrow  head. 

abline(a,  b)  draws  a  line  of  slope  b  and  intercept  a. 
abline(h  =  y)  draws  a  horizontal  line  at  ordinate  y. 
abline(v  =  x)  draws  a  vertical  line  at  abscissa  x. 

abline(lm.obj)  draws  the  regression  line  given  by  lm.obj.  Abline(h  =  0, 
col.  =  2)  #color  (col)  is  often  used. 

rect(xl,  yl,  x2,  y2)  draws  a  rectangle  which  left,  right,  bottom,  and  top  limits  are 
xl,  x2,  yl,  and  y2,  respectively. 

polygon(x,  y)  draws  a  polygon  linking  the  points  with  coordinates  given  by  x 
and  y. 

legend(x,  y,  legend)  adds  the  legend  at  the  point  (x ,  y )  with  the  symbols  given 
by  legend. 

title()  adds  a  plot  title  and  optionally  a  subtitle. 

axis(side,  vect)  adds  an  axis  at  the  bottom  (side  =  1),  on  the  left  (side  =  2),  at 
the  top  (side  =  3),  or  on  the  right  (side  =  4);  vect  (optional)  gives  the  abscissa 
(or  ordinates)  where  tick-marks  are  drawn. 

rug(x)  draws  the  data  x  on  the  x-axis  as  small  vertical  lines. 
locator(n,  type  =  "n",  ...)  returns  the  coordinates  (x,  y)  after  the  user  has 
clicked  n  times  on  the  plot  with  the  mouse;  also  draws  symbols  (type  =  "p")  or 
lines  (type  =  "1")  with  respect  to  optional  graphic  parameters  (...);  by  default 
nothing  is  drawn  (type  =  "n"). 


2.18  Graphics  Parameters 

These  can  be  set  globally  with  par(...)»  Many  can  be  passed  as  parameters  to  plotting 
commands  (Table  2.8). 

adj  controls  text  justification  (adj  =  0  left-justified,  adj  =0.5  centered, 
adj  =  1  right-justified). 


46 


2  Foundations  of  R 


Table  2.8  Common  plotting  functions  for  displaying  variable  relationships  subject  to  conditioning 
(trellis  plots)  available  in  the  R  lattice  package 


Expression 

Explanation 

xyplot(y~x) 

Bivariate  plots  (with  many  functionalities). 

barchart(y~x) 

Histogram  of  the  values  of  y  with  respect  to  those  of  x. 

dotplot(y~x) 

Cleveland  dot  plot  (stacked  plots  line-by-line  and  column-by-column) 

densityplot(~x) 

Density  functions  plot 

histogram(~x) 

Histogram  of  the  frequencies  of  x 

bwplot(y~x) 

“Box-and-whiskers”  plot 

qqmath(~x) 

Quantiles  of  x  with  respect  to  the  values  expected  under  a  theoretical 
distribution 

stripplot(y~x) 

Single  dimension  plot,  x  must  be  numeric,  y  may  be  a  factor 

qq(y~x) 

Quantiles  to  compare  two  distributions,  x  must  be  numeric,  y  may  be 
numeric,  character,  or  factor  but  must  have  two  “levels” 

splom(~x) 

Matrix  of  bivariate  plots 

parallel(~x) 

Parallel  coordinates  plot 

Levelplot 

(z  *y  ||  gl  *  g2) 

Colored  plot  of  the  values  of  z  at  the  coordinates  given  by  x  and  y  (x,  y 
and  z  are  all  of  the  same  length) 

wireframe 

(z  *y  ||  gl  *  gl) 

3d  surface  plot 

cloud 

(z  *y  II  gl  *  gl) 

3d  scatter  plot 

bg  specifies  the  color  of  the  background  (ex.:  bg  =  "  red" ,  bg  =  "blue " ,  ...the 
list  of  the  657  available  colors  is  displayed  with  colors  () ). 

bty  controls  the  type  of  box  drawn  around  the  plot.  Allowed  values  are:  "o",  "1", 
"7",  "c",  "u"  ou  "]"  (the  box  looks  like  the  corresponding  character).  If  bty  =  "n" 
the  box  is  not  drawn. 

cex  a  value  controlling  the  size  of  texts  and  symbols  with  respect  to  the  default. 
The  following  parameters  have  the  same  control  for  numbers  on  the  axes- cex. 
axis,  the  axis  labels- cex  .lab,  the  title- cex  .main,  and  the  subtitle- cex  .sub. 

col.  controls  the  color  of  symbols  and  lines.  Use  color  names:  "red",  "blue"  see 
colors  ()  or  as  "#RRGGBB";  see  rgb  ( ) ,  hsv  ( ) ,  gray  ( ) ,  and  rainbow  ( )  ; 
as  for  cex  there  are:  col .  axis,  col .  lab,  col  .main,  col .  sub. 

font  an  integer  which  controls  the  style  of  text  (1:  normal,  2:  italics,  3:  bold,  4: 
bold  italics);  as  for  cex  there  are:  font .  axis,  font .  lab,  font .  main,  font . 
sub. 

las  an  integer  which  controls  the  orientation  of  the  axis  labels  (0:  parallel  to  the 
axes,  1:  horizontal,  2:  perpendicular  to  the  axes,  3:  vertical). 

lty  controls  the  type  of  lines,  can  be  an  integer  or  string  (1:  "solid",  2:  "dashed",  3: 
"dotted",  4:  "dotdash",  5:  "longdash",  6:  "twodash",  or  a  string  of  up  to  eight 
characters  (between  "0"  and  "9")  which  specifies  alternatively  the  length,  in  points 
or  pixels,  of  the  drawn  elements  and  the  blanks,  for  example  lty  =  "  44  "  will  have 
the  same  effect  than  lty  =  2. 

lwd  a  numeric  which  controls  the  width  of  lines,  default  =  1. 
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mar  a  vector  of  4  numeric  values  which  control  the  space  between  the  axes  and 
the  border  of  the  graph  of  the  form  c  (bottom ,  lef  t ,  top ,  right ) ,  the  default 
values  are  c  ( 5 . 1 ,  4.1,  4.1,  2.1). 

mfcol  a  vector  of  the  form  c  ( nr ,  nc )  which  partitions  the  graphic  window  as  a 
matrix  of  nr  lines  and  nc  columns,  the  plots  are  then  drawn  in  columns, 
mfrow  id.  but  the  plots  are  drawn  by  row. 

pch  controls  the  type  of  symbol,  either  an  integer  between  1  and  25,  or  any  single 
character  within  " " . 

ts.plot(x)  id.  but  if  x  is  multivariate  the  series  may  have  different  dates  by  x  and  y. 
ps  an  integer  which  controls  the  size  in  points  of  texts  and  symbols, 
pty  a  character,  which  specifies  the  type  of  the  plotting  region,  "s":  square,  "m": 
maximal. 

tck  a  value  which  specifies  the  length  of  tick-marks  on  the  axes  as  a  fraction  of  the 
smallest  of  the  width  or  height  of  the  plot;  if  tck  =  1  a  grid  is  drawn. 

tel  a  value  which  specifies  the  length  of  tick-marks  on  the  axes  as  a  fraction  of  the 
height  of  a  line  of  text  (by  default  tel  =  —0.5). 

xaxt  if  xaxt  =  "n"  the  x-axis  is  set  but  not  drawn  (useful  in  conjunction  with 
axis  (side  =  1,  ...)). 

yaxt  if  yaxt  =  "n"  the  y-axis  is  set  but  not  drawn  (useful  in  conjunction  with 
axis  (side  =  2  ,  ...)). 

Lattice  (Trellis)  graphics. 

In  the  normal  Lattice  formula,  y~x  |  gl*g2  has  combinations  of  optional  condi¬ 
tioning  variables  gl  and  g2  plotted  on  separate  panels.  Lattice  functions  take  many 
of  the  same  arguments  as  base  graphics  plus  also  data  =  the  data  frame  for  the 
formula  variables  and  subset  =  for  subsetting.  Use  panel  =  to  define  a  custom 
panel  function  (see  apropos  ( "panel" )  and  ?lines).  Lattice  functions  return 
an  object  of  class  trellis  and  have  to  be  printed  to  produce  the  graph.  Use  print 
(xyplot  (...))  inside  functions  where  automatic  printing  doesn’t  work.  Use 
lattice  .  theme  and  lset  to  change  Lattice  defaults. 


2.19  Optimization  and  model  Fitting 

optim(par,  fn,  method  =  c("Nelder-Mead",  "BFGS",  "CG",  "L-BFGS-B", 
"SANN"))  general-purpose  optimization;  par  is  initial  values,  fn  is  function  to 
optimize  (normally  minimize). 

nlm(f,  p)  minimize  function  fusing  a  Newton-type  algorithm  with  starting 
values  p. 

lm(formula)  fit  linear  models;  formula  is  typically  of  the  form  response  ~ 
termA  +  termB  +  .  .  .;  use  I  (x*y)  +  I  (xA2)  for  terms  made  of  nonlinear 
components. 

glm(formiila,  family=)  fits  generalized  linear  models,  specified  by  giving  a 
symbolic  description  of  the  linear  predictor  and  a  description  of  the  error 
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distribution;  f  ami  ly  is  a  description  of  the  error  distribution  and  link  function  to  be 
used  in  the  model;  see  ?f  amily. 

nls(formula)  nonlinear  least-squares  estimates  of  the  nonlinear  model 
parameters. 

approx(x,  y=)  linearly  interpolate  given  data  points;  x  can  be  an  xy  plotting 
structure. 

spline(x,  y=)  cubic  spline  interpolation. 

loess(formula)  (locally  weighted  scatterplot  smoothing)  fit  a  polynomial  surface 
using  local  fitting. 

Many  of  the  formula-based  modeling  functions  have  several  common  arguments: 
data  =  the  data  frame  for  the  formula  variables,  subset  =  a  subset  of  variables 
used  in  the  fit,  na.  act  ion  =  action  for  missing  values:  "na.fail",  "na. 
omit ",  or  a  function. 

The  following  generics  often  apply  to  model  fitting  functions: 

predict  (fit ,  .  .  . )  predictions  from  fit  based  on  input  data, 
df  .  residual  (fit)  returns  the  number  of  residual  degrees  of  freedom, 
coef  (fit)  returns  the  estimated  coefficients  (sometimes  with  their  standard- 
errors). 

residuals  (fit)  returns  the  residuals, 
deviance  (fit)  returns  the  deviance, 
fitted  (fit)  returns  the  fitted  values. 

logLik(fit)  computes  the  logarithm  of  the  likelihood  and  the  number  of 
parameters. 

AIC  (fit)  computes  the  Akaike  information  criterion  (AIC). 


2.20  Statistics 

There  are  many  R  packages  and  functions  for  computing  a  wide  spectrum  of 
statistics.  Below  are  some  commonly  used  examples,  and  we  will  see  many  more 
throughout: 

aov(formula)  analysis  of  variance  model. 

anova(fit,  ...)  analysis  of  variance  (or  deviance)  tables  for  one  or  more  fitted 
model  objects. 

density(x)  kernel  density  estimates  of  x. 

Other  functions  include:  binom.  test  ( ) ,  pairwise  .  t .  test  ( ) ,  power . 
t .  test  ( ) ,  prop  .  test  ( ) ,  t .  test  ( ) ,  ...  use  help  .  search  ("test")  to 
see  details. 


2.21  Distributions 
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2.21  Distributions 

It’s  easy  to  generate  random  samples  from  different  distributions.  Remember  to 
include  set.  seed  ()  if  you  want  to  get  reproducibility  during  exercises 
(Table  2.9). 

Also,  all  of  these  functions  can  be  used  by  replacing  the  letter  r  with  d,  p  or  q  to 
get,  respectively,  the  probability  density  (dfunc  (x,  ...)),  the  cumulative  prob¬ 
ability  density  (pfunc  (x,  ...)),  and  the  value  of  quantile  (qfunc  (p,  .  .  .  ) , 
with  0  <  p  <  1). 


2.21.1  Programming 

The  standard  setting  for  defining  new  functions  is: 

function . name<-f unction(x)  {  expr(an  expression)  retunn(value)  }, 

where  v  is  the  parameter  in  the  expression.  A  simple  example  of  this  is: 


adding < -function ( x=0}  y=0 ) {z<-x+y 
return (z) } 
adding (x=5j  y=10) 

##  [1]  15 

Table  2.9  Examples  of  R  random  number  generators 

Expression 

Explanation 

rnorm  ( n ,  mean  =  0,  sd  =  1 ) 

Gaussian  (normal) 

rexp (n,  rate  =  1 ) 

Exponential 

rgamma(n,  shape,  scale  =  1) 

Gamma 

rpois(n,  lambda) 

Poisson 

rweibull (n,  shape,  scale  =  1) 

Weibull 

rcauchy(n,  location  =0,  scale  =  1) 

Cauchy 

rbeta(n,  shapel,  shape2) 

Beta 

rt  (n,  df ) 

Student’s  (t) 

rf  (n,  df  1 ,  df  2  ) 

Fisher’s  (F)  (dfl,  df2) 

rchisq(n,  df) 

Pearson  rbinom(n,  size,  prob)  binomial 

rgeom(n,  prob) 

Geometric 

rhyper(nn,  m,  n,  k) 

Hypergeometric 

rlogis(n,  location  =0,  scale  =  1) 

Logistic 

rlnorm(n,  meanlog  =  0,  sdlog  =  1) 

Lognormal 

rnbinom(n,  size,  prob) 

Negative  binomial 

runi f ( n ,  min  =  0 ,  max  =  1 ) 

Uniform 

rwilcox(nn,  m,  n) , rsignrank (nn,  n) 

Wilcoxon’s  statistics 
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Conditions  setting 

if(cond)  {expr} 

or 

if(cond)  cons. expr  else  alt. expr 

x<-10 

if(x>10)  z="T"  else  z="F" 
z 

##  [1]  "F" 

Alternatively,  if  else  represents  a  vectorized  and  extremely  efficient  condi¬ 
tional  mechanism  that  provides  one  of  the  main  advantages  of  R. 

For  loop 

for(var  in  seq)  expr 
x<-c() 

for(i  in  1:10)  x[i]=i 
x 

##  [1]  1  2  3  4  5  6  7  8  9  10 

Other  loops 

While  loop:  while  (cond)  expr. 

Repeat:  repeat  expr . 

Applied  to  innermost  of  nested  loops:  break,  next . 

Use  braces  { }  around  statements. 

ifelse(test,  yes,  no)  a  value  with  the  same  shape  as  test  filled  with  elements  from 
either  yes  or  no. 

do.call(funname,  args)  executes  a  function  call  from  the  name  of  the  function 
and  a  list  of  arguments  to  be  passed  to  it. 


2.22  Data  Simulation  Primer 

Before  we  demonstrate  how  to  synthetically  simulate  data  that  closely  resemble  the 
characteristics  of  real  observations  from  the  same  process.  Start  by  importing  some 
observed  data  for  initial  exploratory  analytics. 

Using  the  SOCR  Health  Evaluation  and  Linkage  to  Primary  (HELP)  Care  Dataset 
we  can  extract  some  sample  data:  00_Tiny_SOCR_HELP_Data_Simmulation.csv 
(Table  2.10,  Figs.  2.8  and  2.9). 


2.22  Data  Simulation  Primer 
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Table  2.10  A  fragment  of  the  SOCR  Health  Evaluation  and  Linkage  to  Primary  (HELP)  Care 
dataset 


ID 

i2 

age 

treat 

homeless 

pcs 

mcs 

cesd 

.  .  . 

female 

Substance 

racegrp 

1 

0 

25 

0 

0 

49 

7 

46 

•  •  • 

0 

Cocaine 

Black  2 

3 

39 

36 

0 

0 

76 

9 

33 

•  •  • 

0 

Heroin 

Black 

100 

81 

22 

0 

0 

37 

17 

19 

.  .  . 

0 

Alcohol 

Other 

Fig.  2.8  Histogram  of  a 
sample  of  200  random 
Normal (m  =  10,  sd  =  20) 
observations 


N(10,  20)  Histogram 


-40  -20  0  20  40  60 

x.norm 


Fig.  2.9  Comparing  the 
histogram  (black)  of  the  real 
ages  of  the  PD  patients  to 
the  synthetically  generated/ 
simulated  ages  (blue) 


Histogram  of  data_1$age 


0  10  20  30  40  50  60  70 

data_1$age 


data_l  <-  read .  csv (  "https : //umich .  instructure,  com/ fiLes/1628625/do\Ajn Load ?dow 
nLoad_frd=l"j  as.is=Tj  header=T) 

#  data_l  =  read.csv(file.choose(  )) 

attach (data_l ) 

#  to  ensure  all  variables  are  accessible  within  R,  e.g.,  using  "age"  instead 
of  data_l$age 
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#  i2  maximum  number  of  drinks  (standard  units)  consumed  per  day  (in  the  pas 
t  30  days  range  0-184)  see  also  il 

#  treat  randomization  group  (0=usual  care,  1=HELP  clinic) 

#  pcs  SF-36  Physical  Component  Score  (range  14-75) 

#  mcs  SF-36  Mental  Component  Score(range  7-62) 

#  cesd  Center  for  Epidemiologic  Studies  Depression  scale  (range  0-60) 

#  indtot  Inventory  of  Drug  Use  Consequences  (InDUC)  total  score  (range  4-45) 

#  pss_fr  perceived  social  supports  (friends,  range  0-14)  see  also  dayslink 

#  drugrisk  Risk-Assessment  Battery(RAB)  drug  risk  score  (range0-21) 

#  satreat  any  BSAS  substance  abuse  treatment  at  baseline  (0=no,  l=yes) 


summary ( data_l ) 


##  ID  i2  age  treat 


## 

Min. 

1.00 

Min. 

0.00 

Min. 

:  3.00 

Min. 

■0. 0000 

## 

1st  Qu. 

24.25 

1st  Qu. 

:  1.00 

1st  Qu. 

:  27.00 

1st  Qu. 

■0. 0000 

## 

Median 

50.50 

Median 

:  15.50 

Median 

:  34.00 

Median 

0.0000 

## 

Mean 

50.29 

Mean 

:  27.08 

Mean 

: 34 . 31 

Mean 

0.1222 

## 

3rd  Qu. 

74.75 

3rd  Qu. 

:  39.00 

3rd  Qu. 

: 43 . 00 

3rd  Qu. 

■0. 0000 

## 

Max. 

100.00 

Max. 

:  137. 00 

Max. 

: 65 . 00 

Max. 

•2.0000 

## 

homeless 

pcs 

mcs 

cesd 

## 

Min. 

0.0000 

Min. 

:  6.00 

Min. 

0. 00 

Min. 

0.  00 

## 

1st  Qu. 

0.0000 

1st  Qu. 

: 41 . 25 

1st  Qu. 

20.25 

1st  Qu.  :17 .25 

## 

Median 

’0. 0000 

Median 

:48.50 

Median 

29.00 

Median  : 30.00 

## 

Mean 

■0.1444 

Mean 

:47.  61 

Mean 

30.49 

Mean  :30.21 

## 

3rd  Qu. 

0.0000 

3rd  Qu. 

:  57.00 

3rd  Qu. 

39.75 

3rd  Qu. :43.00 

## 

Max. 

1 . 0000 

Max. 

: 76 . 00 

Max. 

93.00 

Max.  :68.00 

## 

indtot 

pss_fr 

drugrisk 

sexrisk 

## 

Min. 

0.  00 

Min. 

0.  000 

Min. 

0.  000 

Min. 

0.000 

## 

1st  Qu. 

31.25 

1st  Qu. 

2.000 

1st  Qu. 

0.000 

1st  Qu. 

1.250 

## 

Median 

36.00 

Median 

6 . 000 

Median 

0.  000 

Median 

5.000 

## 

Mean 

37.03 

Mean 

6.533 

Mean 

2.578 

Mean 

4.922 

## 

3rd  Qu. 

45.00 

3rd  Qu. 

10.000 

3rd  Qu. 

3.000 

3rd  Qu. 

7.750 

## 

Max. 

60 . 00 

Max. 

20.000 

Max. 

23.000 

Max. 

13.000 

## 

satreat 

female 

substance 

racegrp 

## 

Min. 

0.00000 

Min. 

: 0.00000  Length: 90 

Length: 90 

##  1st  Qu. :0.00000  1st  Qu. :0.00000  CLass  character  CLass  character 

##  Median  : 0.00000  Median  : 0.00000  Mode  : character  Mode  : character 

##  Mean  : 0.07778  Mean  : 0.05556 

##  3rd  Qu. :0. 00000  3rd  Qu. :0. 00000 

##  Max.  : 1.00000  Max.  : 1.00000 

x.norm  <-  rnorm(n=200j  m=10}  sd=20) 
hist (x. norm ,  main='  N(10}  20)  Histogram ' ) 


mean ( data_l$age ) 

##  [1]  34.31111 
sd(data_l$age) 

##  [1]  11.68947 

Next,  we  will  simulate  new  synthetic  data  to  match  the  properties/characteristics 
of  the  observed  data  (using  Uniform,  Normal,  and  Poisson  distributions): 
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#  i2  [0:  184] 

#  age  m=34,  sd=12 

#  treat  {0,  1} 

#  homeless  {0,  1} 

#  pcs  14-75 

#  mcs  7-62 

#  cesd  0-60 

#  indtot  4-45 

#  pss_fr  0-14 

#  drugrisk  0-21 

#  sexrisk 

#  satreat  (0=no,  l=yes) 

#  female  (0=no,  l=yes) 

#  racegrp  (black,  white,  other) 

#  Demographics  variables 

#  Define  number  of  subjects 
NumSubj  <-  282 

NumTime  <-  4 

#  Define  data  elements 

#  Cases 


Coses  <-  c(2j  3 

j  6j 

7,  8,  10, 

11,  12,  13 

,  14 

,  17, 

18,  20,  21 

,  22. 

,  23, 

24, 

25, 

26 ,  28,  29, 

30, 

31,  32,  33,  34, 

35, 

37, 

41,  42,  43, 

44, 

45, 

53,  55,  58, 

60, 

62,  67,  69, 

71, 

72,  74,  79,  80, 

85, 

87, 

90,  95,  97, 

99, 

100, 

101, 

106, 

107, 

109, 

112, 

120, 

123,  125, 

128, 

129, 

132, 

134, 

136, 

139, 

142, 

147, 

149, 

153, 

158, 

160, 

162, 

163,  167, 

172, 

174, 

178, 

179, 

180, 

182, 

192, 

195, 

201, 

208, 

211, 

215, 

217, 

223,  227, 

228, 

233, 

235, 

236, 

240, 

245, 

248, 

250, 

251, 

254, 

257, 

259, 

261, 

264,  268, 

269, 

272, 

273, 

275, 

279, 

288, 

289, 

291, 

296, 

298, 

303, 

305, 

309, 

314,  318, 

324, 

325, 

326, 

328, 

331, 

332, 

333, 

334, 

336, 

338, 

339, 

341, 

344, 

346,  347, 

350, 

353, 

354, 

359, 

361, 

363, 

364, 

366, 

367, 

368, 

369, 

370, 

371, 

372,  374, 

375, 

376, 

377, 

378, 

381, 

382, 

384, 

385, 

386, 

387, 

389, 

390, 

393, 

395,  398, 

400, 

410, 

421, 

423, 

428, 

433, 

435, 

443, 

447, 

449, 

450, 

451, 

453, 

454,  455, 

456, 

457, 

458, 

459, 

460, 

461, 

465, 

466, 

467, 

470, 

471, 

472, 

476, 

477,  478, 

479, 

480, 

481, 

483, 

484, 

485, 

486, 

487, 

488, 

489, 

492, 

493, 

494, 

496,  498, 

501, 

504, 

507, 

510, 

513, 

515, 

528, 

530, 

533, 

537, 

538, 

542, 

545, 

546,  549, 

555, 

557, 

559, 

560, 

566, 

572, 

573, 

576, 

582, 

586, 

590, 

592, 

597, 

603,  604, 

611, 

619, 

621, 

623, 

624, 

625, 

631, 

633, 

634, 

635, 

637, 

640, 

641, 

643,  644, 

645, 

646, 

647, 

648, 

649, 

650, 

652, 

654, 

656, 

658, 

660, 

664, 

665, 

670,  673, 

677, 

678, 

679, 

680, 

682, 

683, 

686, 

687, 

688, 

689, 

690, 

692j 

#  Imaging 

Biomarkers 

L_coudote_ 

ComputeAreo  <-  rpois 

(NumSubjj 

600; 

L_caudate_VoLume  <-  rpois(NumSubj ,  800 ) 
R_caudate_ComputeArea  <-  rpois(NumSubj ,  893 ) 
R_caudate_VoLume  <-  rpois(NumSubj ,  1000) 
L_putamen_ComputeArea  <-  rpois(NumSubj ,  900 ) 
L_putamen_VoLume  <-  rpois(NumSubj ,  1400) 
R_putamen_ComputeArea  <-  rpois (NumSubj ,  1300) 
R_putomen_VoLume  <-  rpois(NumSubj ,  3000) 
L_hippocampus_ComputeArea  <-  rpois(NumSubj ,  1300) 
L_hippocampus_VoLume  <-  rpois(NumSubj ,  3200) 
R_hippocampus_ComputeArea  <-  rpois(NumSubj ,  1500) 
R_hippocampus_VoLume  <-  rpois(NumSubj ,  3800) 
cerebeLLum_ComputeArea  <-  rpois(NumSubj ,  16700) 
cerebeLLum_VoLume  <-  rpois(NumSubj ,  14000) 
L_LinguaL_gyrus_ComputeArea  <-  rpois (NumSubj ,  3300 ) 
L_LinguaL_gyrus_VoLume  <-  rpois(NumSubj ,  11000) 
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R_LinguaL_gyrus_ComputeArea  <-  rpois(NumSubj j  3300) 

R_LinguaL_gyrus_VoLume  <-  rpois(NumSubj ,  12000) 

L_fusiform_gyrus_ComputeArea  <-  rpois(NumSubj ,  3600) 

L_fusiform_gyrus_VoLume  <-  rpois(NumSubj ,  11000) 
R_fusiform_gyrus_ComputeArea  <-  rpois(NumSubj ,  3300) 

R_fusiform_gyrus_VoLume  <-  rpois(NumSubj ,  10000) 

Sex  <-  ifeLse(runif(NumSubj)<.5j  0,  1) 

height  <-  as.integer(rnorm(NumSubj,  80,  10)) 

Age  <-  as. integer (rnorm(NumSubjj  62 j  10)) 

#  Diagnostic  labels  ( DX ) : 

Dx  <-  c  (rep  ("PD",  100),  rep("HC ",  100),  rep(  "SIaIEDD" ,  82)) 

#  Genetics  traits 

chrl2_rs34637584_GT  <-  c(if else (runif (100) < .3,  0,  1) ,  ifeLse(runif(100)<.6, 
0,  1),  ifeLse(runif(82)< .4,  0,1))  #  NumSubj  Bernoulli  trials 

chrl7_rsll868035_GT  <-  c(if else (runif (100) < .7,  0,  1),  ifeLse(runif(100)<.4 , 
0,  1),  ifeLse(runif(82)<.5,  0,1))  #  NumSubj  Bernoulli  trials 

#  Clinical  #  rpois(NumSubj ^  15)  +  rpois(NumSubj ,  6) 

UPDRS_part_I  <-  c  (if else  (runif  (100)  <  .7,0,1)  +  if  else  (runif  (100)  <  .7,  0,  1) , 

ifeLse(runif(100)< .6,  0,  1)+  ifeLse(runif(100)<.6,  0,  1), 

if  else (runif  (82) < .4,  0,  1)+  if else (runif  (82) < .4,  0,  1)  ) 

UPDRS_part_II  <-  c (sample .int (20 ,  100,  repiace=T) ,  sample. int (14,  100, 
replace=T), 

sample. int (18,  82,  replace=T)  ) 

UPDRS_part_III  <-  c (sample .int (30 ,  100,  replace=T) ,  sample .int (20,  100, 
replace=T),  sample .int (25,  82,  replace=T)  ) 

#  Time:  VisitTime  -  done  automatically  below  in  aggregator 

#  Data  (putting  all  components  together) 
sim_PD_Data  <-  cbind( 


rep(Cases,  each=  NumTime), 

# 

Cases 

rep( L_caudate_ComputeArea,  each=  NumTime), 

# 

Imaging 

rep (Sex,  each=  NumTime), 

# 

Demographics 

rep(Neight,  each=  NumTime), 

rep (Age,  each=  NumTime), 

rep(Dx ,  each=  NumTime), 

# 

Dx 

rep(chrl2_rs34637584_GT,  each=  NumTime), 

# 

Genetics 

rep(chrl7_rsll868035_GT,  each=  NumTime), 

rep(UPDRS_part_I,  each=  NumTime), 

# 

Clinical 

rep(UPDRS_part_II,  each=  NumTime), 

rep(UPDRS_part_III,  each=  NumTime), 

rep(c(0,  6,  12,  18),  NumSubj) 

#  Time 

) 

#  Assign  the  column  names 

colnames(sim_PD_Data)  <-  c( 

"Cases 


j 
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"  L_caudate_ComputeArea  ", 

"Sex",  "  Weight",  "Age", 

"Dx  ",  " chrl2_rs34637584_GT ",  "chrl7_rsll868035_GT" , 
"UPDRS_part_I ",  "UPDRS_part_II ",  "UPDRS_part_III ",  "Time 

) 

#  some  QC 

summary ( sim_PD_Data ) 


## 

Cases 

L_caudate_ComputeArea  Sex 

Weight 

Age 

## 

10 

4 

594 

36 

0:592 

77 

72 

59 

68 

## 

100 

4 

607 

36 

1:536 

83 

60 

58 

56 

## 

101 

4 

618 

32 

76 

56 

61 

56 

## 

106 

4 

581 

28 

78 

56 

60 

52 

## 

107 

4 

585 

28 

70 

44 

67 

52 

## 

109 

4 

599 

28 

75 

44 

65 

48 

## 

(Other) 

1104 

(Other) 

940 

(Other) 

796 

(Other) 

796 

## 

Dx 

chrl2_rs34637584_ 

GT  chrl7_rsll868035_ 

GT  UPDRS_part_I 

## 

HC  :400  0 

:464 

0:592 

0:412 

## 

PD  :400  1 

:664 

1:536 

1:520 

## 

SIaIEDD  :  328 

2:196 

## 

## 

## 

UPDRS_part_II 

UPDRS_part_III 

Time 

## 

13 

108 

1 

80 

0  : 282 

## 

9 

100 

13 

68 

12:282 

## 

5 

88 

19 

60 

18:282 

## 

14 

76 

7 

60 

6  : 282 

## 

6 

76 

12 

56 

## 

3 

72 

6 

56 

## 

(Other) 

608 

(Other) 

748 

dim( sim_PD_Data ) 
##  [1]  1128  12 
head ( sim_PD_Data ) 


##  Cases  L 

_caudate_ 

ComputeArea 

Sex  Weight 

Age  Dx 

chr!2_rs34637584_GT 

##  [ i ,] 

ii  ^  u 

"618" 

"0"  ‘ 

"75" 

"59" 

"PD 

"  "1" 

##  [2,] 

n  ^  n 

"618" 

"0"  ‘ 

"75" 

"59" 

"PD 

"  "1" 

##  [3,] 

n  ^  n 

"618" 

"0"  ‘ 

"75" 

"59" 

"PD 

"  "1" 

##  [4,] 

n  2 11 

"618" 

"0"  1 

"75" 

"59" 

"PD 

"  "1" 

##  [5,] 

n ^  n 

"621 " 

"0"  1 

"61 " 

"44  " 

"PD 

"  "1" 

##  [6,] 

"621 " 

"0"  ‘ 

"61 " 

"44  " 

"PD 

"  "1" 

## 

chrl7_ 

rsll868035  GT  UPDRS  part  I 

UPDRS  part 

II  1 

UPDRS  part  III  Time 

##  [l,] 

"0" 

ii  n 

n  2  a 

"4  " 

"0" 

##  [2,] 

"0" 

ii  ^  n 

n  2  a 

"4  " 

"6" 

##  [3,] 

"0" 

n  ^  n 

n  2  a 

"4  " 

"12" 

##  [4,] 

"0" 

n  n 

n  2  a 

"4  " 

"18" 

##  [5,] 

if  n 

n  ^  ii 

"13" 

i 

"6" 

"0" 

##  [6,] 

ii  ii 

n  ^  ii 

"13" 

i 

"6" 

"6" 

hist(data_l$age,  freq=FALSE,  right=FALSE,  yLim  =  c (0,0. 05) ) 

L ines ( dens ity(as, numeric ( as . data .  frame ( s im_PD_Data ) $Age )),  l wd=2,  coi= "b Lue") 
Legend( "topright " ,  c("Raw  Data",  "Simulated  Data"),  fiLL=c("black" ,  "blue")) 
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#  Save  Results 

#  Write  out  (save)  the  result  to  a  file  that  can  be  shared 

write. tabLe(sim_PD_Dataj  "output_data . csv" ,  sep=" }  ",  row. names=FALSEj  coL.n 
omes=TRUE ) 


2.23  Appendix 

2.23.1  HTML  SOCR  Data  Import 


SOCR  Datasets  can  automatically  be  downloaded  into  the  R  environment  using  the 
following  protocol,  which  uses  the  Parkinson's  Disease  dataset  as  an  example: 

Library ( rvest) 

wiki_urL  <-  read_htmL (" http :/ /wiki . socr . umich . edu/index . php/SOCR_Data_PD_Bio 
medBigMetadata ") 

htmL_nodes ( wibi_ur "#content ") 

##  {xmL_nodeset  (1)} 

##  [1]  <div  id=" content"  c Las s="mvi- body -primary"  roLe="main" >\n\t<a  id="top 


pd_data  <-  htmL_tabLe(htmL_nodes(\Aiiki_urLJ  "tabLe" ) [ [1] ] ) 
head(pd_data );  summary (pd_data) 

##  Cases  L_caudate_ComputeArea  L_caudate_VoLume  R_caudate_ComputeArea 


##  1 

2 

597 

767 

855 

##  2 

2 

597 

767 

855 

##  3 

2 

597 

767 

855 

##  4 

2 

597 

767 

855 

##  5 

3 

604 

873 

935 

##  6 

3 

604 

873 

935 

## 

chr!7_rsll868035_GT  UPDRS_part_I 

UPDRS_part_II  UPDRS_part_ 

H 

H 

H 

Time 

## 

1 

0 

1 

12 

1 

0 

## 

2 

0 

1 

12 

1 

6 

## 

3 

0 

1 

12 

1 

12 

## 

4 

0 

1 

12 

1 

18 

## 

5 

1 

0 

19 

22 

0 

## 

6 

1 

0 

19 

22 

6 

## 

Cases 

L_ caudate_ComputeArea  L_ caudate_ Vo L ume 

## 

Min.  :  2.0 

Min. 

: 525.0 

Min.  : 719.0 

## 

1st  Qu. : 158.0 

1st  Qu. 

: 582.0 

1st  Qu. : 784.0 

## 

Median  :363.5 

Median 

: 600. 0 

Median  : 800.0 

## 

Mean  : 346.1 

Mean 

: 600.4 

Mean  :800.3 

## 

3rd  Qu. : 504.0 

3rd  Qu. 

: 619.0 

3rd  Qu. :819. 0 

## 

Max.  : 692.0 

Max. 

:  667.0 

Max.  : 8 90.0 

Also  see:  http://wiki.socr.umich.edu/index.php/SMHS_DataSimulation. 
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2.23.2  R  Debugging 

Most  programs  that  give  incorrect  results  are  impacted  by  logical  errors.  When  errors 
(bugs,  exceptions)  occur,  we  need  explore  deeper  -  this  procedure  to  identify  and  fix 
bugs  is  “debugging”. 

Common  R  tools  for  debugging  inlcude  traceback( ),  debug( ),  browser( ),  trace( ) 
and  recover(). 

traceback():  Failing  R  functions  report  the  error  to  the  screen  immediately  the 
error.  Calling  traceback()  will  show  the  function  where  the  error  occurred.  The 
traceback()  function  prints  the  list  of  functions  that  were  called  before  the  error 
occurred. 

The  function  calls  are  printed  in  reverse  order. 

fl<-function(x)  {  r<-  x-gl(x);  r  } 
gl<-function(y)  {  r<-y*hl(y) ;  r  } 

hl< -function (z)  {  r<-Log(z) ;  if(r<10)  rA2  else  rA3} 

fl(-l) 

##  Warning  in  Log(z):  NaNs  produced 

##  Error  in  if  (r  <  10)  rA2  eLse  rA3:  missing  value  where  TRUE/FALSE  needed 

traceback() 

3:  h(y) 

##  Error  in  evaL(exprj  envir}  encLos):  could  not  find  function  "h" 

2:  g(x) 

##  Error  in  evaL(exprj  envir}  encLos):  could  not  find  function  "g" 

1:  f(-l) 

##  Error  in  evaL(exprj  envir ,  encLos):  couLd  not  find  function  "f" 

debug() 

traceback  ( )  does  not  tell  you  where  is  the  error.  To  find  out  which  line  causes 
the  error,  we  may  step  through  the  function  using  debug  ( ) . 

debug  (foo)  flags  the  function  foo()  for  debugging.  Undebug  (foo) 
unflags  the  function. 

When  a  function  is  flagged  for  debugging,  each  statement  in  the  function  is 
executed  one  at  a  time.  After  a  statement  is  executed,  the  function  suspends  and 
user  can  interact  with  the  R  shell. 

This  allows  us  to  inspect  a  function  line-by-line. 

An  example  computing  the  sum  of  squared  errors,  SS. 
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##  compute  sum  of  squares 
SS<- function (mu ,  x)  { 
d<-x-mu; 
d2<-dA2; 
ss<-sum(d2) ; 
ss  } 

set.seed(100); 
x<-rnorm( 100); 

SS(lj  x) 

##  to  debug 
debug (SS) ;  SS^l,  x) 

##  debugging  in:  SS(lj  x) 

##  debug  at  <text>#2:  { 

##  d  <-  x  -  mu 

##  c/2  <-  c/A2 

##  ss  <-  sum(d2) 

##  ss 

##  } 

##  debug  at  <text>#3:  d  <-  x  -  mu 
##  debug  at  <text>#4 :  d2  <-  dA2 

##  debug  at  <text>#5 :  ss  <-  sum(d2) 

##  debug  at  <text>#6:  ss 

##  exiting  from:  SS(1}  x) 

##  [1]  202.5614519 

In  the  debugging  shell  ("Browse[l]  >  "),  users  can: 

•  Enter  n  (next)  executes  the  current  line  and  prints  the  next  one; 

•  Typing  c  (continue)  executes  the  rest  of  the  function  without  stopping; 

•  Enter  Q  quits  the  debugging; 

•  Enter  ls()  list  all  objects  in  the  local  environment; 

•  Enter  an  object  name  or  print()  tells  the  current  value  of  an  object. 

Example: 

debug (SS) 

SS(lj  x) 

##  debugging  in:  SS(lj  x) 

##  debug  at  <text>#2:  { 

##  d  <-  x  -  mu 

##  d2  <-  dA2 

##  ss  <-  sum(d2) 

##  ss 

##  } 

##  debug  at  <text>#3:  d  <-  x  -  mu 
##  debug  at  <text>#4:  d2  <-  dA2 

##  debug  at  <text>#5:  ss  <-  sum(d2) 

##  debug  at  <text>#6:  ss 

##  exiting  from:  SS(1}  x) 


##  [1]  202.5614519 
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Browse[l]>  n 

debug:  d  <-  x  -  mu  ##  the  next  command 

Browse[l]>  ls()  ##  current  environment  [1]  "mu"  "x"  ##  there  is  no  d 
Browse[l]>  n  ##  go  one  step  debug:  d2  <-  dA2  ##  the  next  command 
Browse[l]>  ls()  ##  current  environment  [1]  "d"  "mu"  "x"  ##  d  has  been  created 
Browse[l]>  d[l:3]  ##  first  three  elements  of  d  [1]  -1.5021924  -0.8684688  -1.0789171 
Browse[l]>  hist(d)  ##  histogram  of  d 

Browse[l]>  where  ##  current  position  in  call  stack  where  1:  SS(1,  x) 

Browse[l]>  n 
debug:  ss  <-  sum(d2) 

Browse[l]>  Q##  quit 

undebug(SS)  ##  remove  debug  LabeLj  stop  debugging  process 

SS(lj  x)  ##  now  cull  SS  again  will  without  debugging 


You  can  label  a  function  for  debugging  while  debugging  another  function. 

f<-function(x) 

{  r<-x-g(x); 
r  } 

g<-function(y) 

{  r<-y*h(y) ; 
r  } 

h<-function(z) 

{  r<-log(z); 
if(r<10) 
rA2 

else 
rA3  } 

debug(f)  #  ##  If  you  only  debug  f,  you  will  not  go  into  g 

f(~l) 

##  Warning  in  Log(z):  NaNs  produced 

##  Error  in  if  (r  <  10)  rA2  else  rA3:  missing  value  where  TRUE/FALSE  needed 

Browse[l]>  n 
Browse[l]>  n 


But,  we  can  also  label  g  and  h  for  debugging  when  we  debug  f. 

f(-i) 

Browse[l]>  n 
Browse[l]>  debug(g) 

Browse[l]>  debug(h) 

Browse[l]>  n 


Inserting  a  call  to  browser()  in  a  function  will  pause  the  execution  of  a  function  at 
the  point  where  browser()  is  called.  This  is  similar  to  using  debug(),  except  you  can 
control  where  execution  gets  paused. 
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Example 

h<-function(z)  { 

browser ()  ##  a  break  point  inserted  here 

r<-Log(z) ; 

if(r<10) 

rA2 

else 

rA3 

} 

f(~l) 

##  Error  in  if  (r  <  10)  rA2  else  rA3:  missing  value  where  TRUE/FALSE  needed 

Browse[l]>  ls()  Browse[l]>  z 
Browse[l]>  n 
Browse[l]>  n 
Browse[l]>  ls() 

Browse[l]>  c 

Calling  trace()  on  a  function  allows  inserting  new  code  into  a  function.  The 
syntax  for  trace()  may  be  challenging. 

as.list(body(h)) 
tracef'h",  quote( 
if(is.nan(r)) 

{browser()}),  at=3,  print=FALSE) 

f(l) 

f(-l) 

trace("h",  quote(if(z<0)  {z< - 1}),  at=2,  print=FALSE) 
f(-l) 

untraceQ 

During  the  debugging  process,  recover()  allows  checking  the  status  of  variables 
in  upper  level  functions.  Recover()  can  be  used  as  an  error  handler  using  options() 
(e.g.  options(error  =  recover)).  When  functions  throw  exceptions,  execution  stops  at 
point  of  failure.  Browsing  the  function  calls  and  examining  the  environment  may 
indicate  the  source  of  the  problem. 


2.24  Assignments:  2.  R  Foundations 

2.24.1  Confirm  that  You  Have  Installed  R/RStudio 

You  should  be  able  to  download  and  load  the  Foundations  of  R  code  in  RStudio  and 
then  run  all  the  examples. 
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2.24.2  Long-to-Wide  Data  Format  Translation 

Load  the  SOCR  Parkinson’s  Disease  data  in  the  long-format  (http://wiki.socr.umich. 
edu/index.php/SOCR_Data_PD_BiomedBigMetadata#Data_Table)  and  export  it  in 
the  wide  format.  You  can  only  do  five  variables  you  choose  (not  all),  but  note  that 
there  are  several  time  observations  for  each  subject.  You  can  try  using  the  reshape 
command. 


2.243  Data  Frames 

Create  a  Data  Frame  of  the  SOCR  Parkinson’s  Disease  data  and  compute  a  summary 
of  three  features  you  select. 


2.24.4  Data  Stratification 

Using  the  same  SOCR  Parkinson’s  Disease  data: 

•  Extract  the  first  10  subjects 

•  Find  the  cases  for  which  L_caudate_ComputeArea  <6  00 

•  Sort  the  subjects  based  on  L_caudate_Volume 

•  Gernerate  frequency  and  probability  tables  for  Gender  and  Age 

•  Compute  the  mean  Age  and  the  correlation  between  Age  and  Weight 

•  Plot  Histogram  and  density  of  R_fusif  orm_gyrus_Volume  and  scatterplot 
L_fusif  orm_gyrus_Volume  and  R_fusif  orm_gyrus_Volume. 

Note:  You  don’t  have  to  apply  these  data  filters  sequentially,  but  this  can  also  be 
done  for  deeper  stratification. 


2.24.5  Simulation 

Generate  1,000  standard  normal  variables  and  1,200  Cauchy  distributed  random 
variables  and  generate  a  quantile-quantile  (Q-Q)  probability  plot  of  the  two  samples. 
Repeat  this  with  1,500  student  t  distributed  random  variables  with  df  =  20  and 
generate  a  quantile-quantile  (Q-Q)  probability  plot. 
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2.24. 6  Programming 

Generate  a  function  that  computes  the  arithmetic  average  and  compare  it  against  the 
mean  ( )  function  using  the  simulation  data  you  generated  in  the  last  question. 
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Chapter  3 

Managing  Data  in  R 


® 
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updates 


In  this  Chapter,  we  will  discuss  strategies  to  import  data  and  export  results. 
Also,  we  are  going  to  learn  the  basic  tricks  we  need  to  know  about  processing 
different  types  of  data.  Specifically,  we  will  illustrate  common  R  data  structures  and 
strategies  for  loading  (ingesting)  and  saving  (regurgitating)  data.  In  addition,  we  will 
(1)  present  some  basic  statistics,  e.g.,  for  measuring  central  tendency  (mean,  median, 
mode)  or  dispersion  (variance,  quartiles,  range);  (2)  explore  simple  plots;  (3)  dem¬ 
onstrate  the  uniform  and  normal  distributions;  (4)  contrast  numerical  and  categorical 
types  of  variables;  (5)  present  strategies  for  handling  incomplete  (missing)  data;  and 
(6)  show  the  need  for  cohort-rebalancing  when  comparing  imbalanced  groups  of 
subjects,  cases  or  units. 


3.1  Saving  and  Loading  R  Data  Structures 

Let’s  start  by  extracting  the  Edgar  Anderson’s  Iris  Data  from  the  package 
datasets.  The  iris  dataset  quantifies  morphologic  shape  variations  of  50  Iris 
flowers  of  three  related  genera  -  Iris  setosa ,  Iris  virginica  and  Iris  versicolor.  Four 
shape  features  were  measured  from  each  sample  -  length  and  the  width  of  the  sepals 
and  petals  (in  centimeters).  These  data  were  used  by  Ronald  Fisher  in  his  1936  linear 
discriminant  analysis  paper  (Fig.  3.1). 

data( ) 
data( iris) 
cLass( iris) 

##  [1]  "data .frame" 


©  Ivo  D.  Dinov  2018 

I.  D.  Dinov,  Data  Science  and  Predictive  Analytics , 
https://doi.org/10.1007/978-3-319-72347-l_3 
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Iris  Virginia  Iris  Versicolor  Iris  Setosa 

Fig.  3.1  Definitions  of  petal  width  and  length  for  the  three  iris  flower  genera  used  in  the  example 
below 


As  an  I/O  (input/output)  demonstration,  after  we  load  the  iris  data  and  examine 
its  class  type,  we  can  save  it  into  a  file  named  "myData.RData"  and  then  reload  it 
back  into  R. 

save( iriSj  fiie="myData . RData" ) 

Load ("my Data . RData" ) 


3.2  Importing  and  Saving  Data  from  CSV  Files 


Importing  the  data  from  MCaseStudy07_WorldDrinkingWater_Data  .  csv" 
from  these  case-studies  (https://umich.instructure.com/courses/38100/files/folder/ 
Case_Studies)  and  saving  it  into  the  R  dataset  named  “water”  The  variables  in  the 
dataset  are  as  follows: 

•  Time:  Years  (1990,  1995,  2000,  2005,  2010,  2012) 

•  Demographic:  Country  (across  the  world) 

•  Residence  Area  Type:  Urban,  rural,  or  total 

•  WHO  Region 

•  Population  using  improved  drinking  water  sources:  The  percentage  of  the 
population  using  an  improved  drinking  water  source. 

•  Population  using  improved  sanitation  facilities:  The  percentage  of  the  popu¬ 
lation  using  an  improved  sanitation  facility. 

Generally,  the  separator  of  a  CSV  file  is  comma.  By  default,  we  have  option 
sep  =  " ,  "  in  the  command  read .  csv  ( ) .  Also,  we  can  use  colnames  ( )  to 
rename  the  column  variables. 


3.2  Importing  and  Saving  Data  from  CSV  Files 
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water  <-  read.csv( 'https://umich.instructure.com/fiLes/399172/downLoadPdownL 
oad_frd=l'j  header=T) 
water[l: 3j  ] 

##  Year ..  string .  Ia/HO.  region ..  string .  Country ..  string . 

##  1  1990  Africa  ALgeria 

##  2  1990  Africa  An  go  La 

##  3  1990  Africa  Benin 


##  Residence. Area. Type. .string. 

##  1  RuraL 

##  2  RuraL 

##  3  RuraL 

##  PopuLation . using . improved. drinking .water . sources . numeric . 

##  1  88 

##  2  42 

##  3  49 

##  PopuLation . using . improved. sanitation . faciLities . numeric. 

##  1  77 

##  2  7 

##  3  0 


coLnames(water)<-c("year"j  " region ",  "country " j  "residence_area" j  "improved_ 
water  ",  " sanitation_faci Lities ") 
water [1 : 3}  ] 

##  year  region  country  residence_area  improved_water  sanitation_faci Lities 


## 

1 

1990  Africa  ALgeria 

RuraL 

88 

77 

## 

2 

1990  Africa  An go  La 

RuraL 

42 

7 

## 

3 

1990  Africa  Benin 

RuraL 

49 

0 

which. max( water$year) ; 


##  913 

#  rowMeans(water[,5:6]) 
mean(water[j6]j  trim=0.08j  na.rm=T) 

##  [1]  71.63629 

This  code  loads  CSV  files  that  already  include  a  header  line  listing  the  names  of 
the  variables.  If  we  don’t  have  a  header  in  the  dataset,  we  can  use  the 
header  =  FALSE  option  (https://umich.instructure.com/courses/38100/files/ 
folder/Case_Studies).  R  will  assign  default  names  to  the  column  variables  of  the 
dataset. 

SimuLation  <-  read. csv (" https :// umich . instructure.com/fiLes/354289/downLoad? 
downLoad_frd=l" ,  header  =  FALSE) 

SimuLation [1: 3 j  ] 

##  VI  V2  V3  V4  VS  V6  V7  V8  V9  V10  Vll  V12 

##  1  ID  i2  age  treat  homeLess  pcs  mes  cesd  indtot  pss_fr  drugrisk  sexrisk 

##  2  1  0  25  0  0  49  7  46  37  0  1  6 

##  3  2  18  31  0  0  48  34  17  48  0  0  11 

##  V13  V14  V15  V16 

##  1  satreat  femaLe  substance  racegrp 

##  2  0  0  cocaine  bLack 

##  3  0  0  aLcohoL  white 
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To  save  a  data  frame  to  CSV  files,  we  could  use  the  write  .  csv  ( )  function. 
The  option  file  =  "a/local/file/path"  allows  us  edit  the  saved  file  path. 

write. csv(iris,  file  =  "C : /Users/iris . csv" )  #  Iris  data 

write.  csv(waterj  file  =  "C : /Users/WFIO_Water .  csv" )  #  World  Drinking  Water 


3.3  Exploring  the  Structure  of  Data 


We  can  use  the  command  str  ( )  to  explore  the  structure  of  a  dataset.  For  instance, 
using  the  World  Drinking  Water  dataset: 


str (water) 


##  'data. frame' : 

##  $  year 

90  1990  . . . 

##  $  region 

11111111  ... 
##  $  country 

3  26  27  30  32  33  37 
##  $  residence_area 

111111  ... 

##  $  improved_water 


3331  obs.  of  6  variables : 

:  int  1990  1990  1990  1990  1990  1990  1990  1990  19 

:  Factor  w/  6  Levels  " Africa" , "Americas" j  .. :  1  1 

:  Factor  w/  192  Levels  "Afghanistan",..:  3  5  19  2 

:  Factor  w/  3  levels  "Rural",  "Total",  ..  :  1111 

:  num  88  42  49  86  39  67  34  46  37  83  ... 


##  $  sanitation_f acilities:  num  77  7  0  22  2  42  27  12  4  11  ... 


We  can  see  that  this  (World  Drinking  Water)  dataset  has  3331  observations 
and  6  variables.  The  output  also  give  us  the  class  of  each  variable  and  first  few 
elements  in  the  variable. 


3.4  Exploring  Numeric  Variables 


Summary  statistics  for  numeric  variables  in  the  dataset  could  be  accessed  by  using 
the  command  summary  ( )  (Fig.  3.2). 


Fig.  3.2  Density  plot  of  the 
water  improvement  variable 
in  the  World  Health 
Organization  (WHO)  water 
quality  case- study. 


density,default(x  =  water$tmproved_water,  na.rm  - 

<o 


N  =  3299  Bandwidth  -  2.923 
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summary ( water$year ) 


## 

Min.  1st  Qu. 

Median 

Mean 

3rd  Qu.  Max. 

## 

1990 

1995 

2005 

2002 

2010  2012 

summary  ( i^ater  [c(  "improve  d_iAjater" , 

"  sanitation_f aciiities" ) ] ) 

## 

improved_water 

sanitation_faciLities 

## 

Min. 

3.0 

Min. 

0.  00 

## 

1st  Qu. 

77.0 

1st  Qu. 

42.00 

## 

Median 

93.0 

Median 

81.00 

## 

Mean 

84.9 

Mean 

68.87 

## 

3rd  Qu. 

99.0 

3rd  Qu. 

97.00 

## 

Max. 

100.0 

Max. 

100.00 

## 

NA's 

32 

NA's 

135 

pLot (density (iAjater$improved_iAjater j  na .  rm  =  T)) 

#  variables  need  not  be  continuous,  we  can  still  get  intuition  about  their 
distribution 

The  six  summary  statistics  and  NA’s  (missing  data)  are  reported  in  the  R  output 
above. 


3.5  Measuring  the  Central  Tendency:  Mean,  Median,  Mode 

Mean  and  median  are  two  frequent  measurements  of  the  central  tendency.  Mean  is 
“the  sum  of  all  values  divided  by  the  number  of  values”.  Median  is  the  number  in  the 
middle  of  an  ordered  list  of  values.  In  R,  mean  ( )  and  median  ( )  functions  provide 
us  with  these  two  measurements. 

vecl<-c(40j  56 ,  99) 
mean( vecl ) 

##  [1]  65 

mean (c (40 ,  56 ,  99)) 

##  [1]  65 

median ( vecl ) 

##  [1]  56 

median (c (40 ,  56,  99)) 

##  [1]  56 

#  install. packages("psych"); 

Library ( " psych ") 

geometric .mean (y eel j  na.rm=TRUE) 


##  [1]  60.52866 
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The  mode  is  the  value  that  occurs  most  often  in  the  dataset.  It  is  often  used  in 
categorical  data,  where  mean  and  median  are  inappropriate  measurements. 

We  can  have  one  or  more  modes.  In  the  water  dataset,  we  have  “Europe”  and 
“Urban”  as  the  modes  for  region  and  residence  area,  respectively.  These  two  vari¬ 
ables  are  unimodal,  which  has  a  single  mode.  For  the  year  variable,  we  have  two 
modes:  2000  and  2005.  Both  of  the  categories  have  570  counts.  The  year  variable  is 
an  example  of  a  bimodal.  We  also  have  multimodal  data  that  has  two  or  more  modes. 

Mode  is  just  one  of  the  measures  for  the  central  tendency.  The  best  way  to  use  it  is 
to  compare  the  counts  of  the  mode  to  other  values.  This  help  us  to  judge  whether  one 
or  several  categories  dominates  all  others  in  the  data.  After  that,  we  are  able  to 
analyze  the  meaning  behind  these  common  centrality  measures. 

In  numeric  datasets,  the  mode(s)  represents  the  highest  bin(s)  in  the  histogram. 
In  this  way,  we  can  also  examine  if  the  numeric  data  is  multimodal. 

More  information  about  measures  of  centrality  is  available  here  (http://wiki.socr. 
umich.edu/index.php/AP_Statistics_Curriculum_2007_EDA_Center). 


3.6  Measuring  Spread:  Quartiles  and  the  Five-Number 
Summary 

The  five-number  summary  describes  the  spread  of  a  dataset.  They  are: 

•  Minimum  (Min . ),  representing  the  smallest  value  in  the  data 

•  First  quantile/Ql  (1  s  t  Qu . ),  representing  the  25th  percentile,  which  splits  off  the 
lowest  25%  of  data  from  the  highest  75% 

•  Median/Q2  (Median),  representing  the  50th  percentile,  which  splits  off  the 
lowest  50%  of  data  from  the  top  50% 

•  Third  quantile/Q3  (3rd  Qu . ),  representing  the  75th  percentile,  which  splits  off 
the  lowest  75%  of  data  from  the  top  25% 

•  Maximum  (Max . ),  representing  the  largest  value  in  the  data. 

Min  and  Max  can  be  obtained  by  using  min  ( )  and  max  ( )  respectively. 

The  difference  between  maximum  and  minimum  is  known  as  range.  In  R,  the 
range  ( )  function  gives  us  both  the  minimum  and  maximum.  A  combination  of 
range  ( )  and  dif  f  ( )  could  do  the  trick  of  getting  the  actual  range  value. 

range ( water$year ) 

##  [1]  1990  2012 
diff(  range  ( \A>ater$year  )  ) 

##  [1]  22 

Q1  and  Q3  are  the  25th  and  75th  percentiles  of  the  data.  Median  (Q2)  is  right  in 
the  middle  of  Q1  and  Q3.  The  difference  between  Q3  and  Q1  is  called  the 
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interquartile  range  (IQR).  Within  the  IQR  lies  half  of  our  data  that  has  no  extreme 
values. 

In  R,  we  use  the  IQR  ( )  to  calculate  the  interquartile  range.  If  we  use  IQR  ( )  for  a 
data  with  NA‘s,  the  NA’s  are  ignored  by  the  function  while  using  the  option  na . 
rm  =  TRUE. 

IQR(  water$year  ) 

##  [1]  15 

summary  ( \Ajater$improved_water  ) 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max.  NA's 

##  3.0  77.0  93.0  84.9  99.0  100.0  32 

IQR(\Ajater$improved_]Aiaterj  na.rm  =  T) 

##  [1]  22 

Similar  to  the  command  summary  ( )  that  we  talked  about  earlier  in  this  Chapter, 
the  function  quantile  ( )  could  be  used  to  obtain  the  five-number  summary. 

quantile (\Aiater$improved_\A)aterj  na.rm  =  T) 

##  0%  25%  50%  75%  100% 

##  3  77  93  99  100 

We  can  also  calculate  specific  percentiles  in  the  data.  For  example,  if  we  want  the 
20th  and  60th  percentiles,  we  can  do  the  following. 

quantile (water$improved_waterj  probs  =  c(0.2j  0.6) j  na.rm  =  T) 

##  20%  60% 

##  71  97 

Using  the  seq  ( )  function,  we  can  generate  percentiles  that  are  evenly-spaced. 

quantile (\A)ater$improved_waterj  seq(from=0j  to=l}  by=0.2)}  na.rm  =  T) 

##  0%  20%  40%  60%  80%  100% 

##  3  71  89  97  100  100 

Let’s  reexamine  the  five-number  summary  for  the  improved  water  variable. 
When  we  ignore  the  NA‘s,  the  difference  between  minimum  and  Q1  is  74  while  the 
difference  between  Q3  and  maximum  is  only  1.  The  interquartile  range  is  22%. 
Combining  these  facts,  the  first  quarter  is  more  widely  spread  than  the  middle  50%  of 
values.  The  last  quarter  is  the  most  condensed  one  that  has  only  two  percentages: 
99%  and  100%.  Also,  we  can  notice  that  the  mean  is  smaller  than  the  median.  The 
mean  is  more  sensitive  to  the  extreme  values  than  the  median.  We  have  a  very  small 
minimum  that  makes  the  range  of  first  quantile  very  large.  This  extreme  value 
impacts  the  mean  more  than  the  median. 


70 


3  Managing  Data  in  R 


3.7  Visualizing  Numeric  Variables:  Boxplots 

We  can  visualize  the  five-number  summary  by  a  boxplot  (box-and- whiskers  plot). 
With  the  boxplot  ( )  function  we  can  specify  the  title  (main  =  " ")  and  labels  for 
x  (xlab  =  " ")  and  y  (ylab  =  " ")  axes  (Fig.  3.3). 

boxpioij(]Aiater$'improyedl_]AiaterJ  main="BoxpLot  for  Percent  improved_\Ajater"j 
yLab= "Percentage ") 

In  the  boxplot,  we  have  five  horizontal  lines.  Each  represents  the  corresponding 
value  in  the  five-number  summary.  The  box  in  the  middle  represents  the  middle  50% 
of  values.  The  bold  line  in  the  box  is  the  median.  Mean  value  is  not  illustrated  on  the 
graph. 

Boxplots  only  allow  the  two  ends  to  extend  to  a  minimum  or  maximum  of  1.5 
times  the  IQR.  Therefore,  any  value  that  falls  outside  of  the  3  x  IQR  range  will  be 
represented  as  circles  or  dots.  They  are  considered  as  potential  outliers.  We  can  see 
that  there  are  a  lot  of  candidate  outliers  with  small  values  on  the  low  end  of  the  graph. 


Fig.  3.3  Boxplot  of  the 
water  improvement  variable 
in  the  WHO  dataset 
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3.8  Visualizing  Numeric  Variables:  Histograms 

A  histogram  is  another  way  to  show  the  spread  of  a  numeric  variable  (See  Chap.  4 
for  additional  information).  It  uses  predetermined  number  of  bins  as  containers  for 
values  to  divide  the  original  data.  The  height  of  the  bins  indicates  frequency 
(Figs.  3.4  and  3.5). 

hist(water$improved_\Ajciterj  main  =  "Histogram  of  Percent  improved_\Ajater"  j  xL 
ab= "Percentage ") 


hist (water$sanitation_f aciiities j  main  =  "Histogram  of  Percent  sanitation_f 
aciLities"j  xLab  =  "Percentage" ) 

We  could  see  that  the  shape  of  two  graphs  are  somewhat  similar.  They  are  both 
left  skewed  patterns  (mean  <  median).  Other  common  skew  patterns  are  shown  in 
Fig.  3.6. 


Fig.  3.4  Histogram  plot  of 
the  water  improvement  data 
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Fig.  3.5  Histogram  plot  of 
overall  proportion  of  regions 
with  sanitation  facilities 
(WHO  water  dataset) 
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Fig.  3.6  Shape  differences  between  skewed  and  symmetric  distributions 


72 


3  Managing  Data  in  R 


WCR  LjijM-r!  ir-cn!s 


■■>1* 


-*>*  #  ■  fi* 


Sampfin 
Simulation 


Calculators 


Fig.  3.7  Live  visualization  demonstrations  using  SOCR  and  Distributome  resources 


You  can  see  the  density  plots  of  over  80  different  probability  distributions  using 
the  SOCR  Java  Distribution  Calculators  (http://socr.umich.edu/html/dist/)  or  the 
Distributome  HTML5  Distribution  Calculators  (http://www.distributome.org/tools. 
html),  Fig.  3.7. 


3.9  Understanding  Numeric  Data:  Uniform 
and  Normal  Distributions 

If  the  data  follows  a  uniform  distribution,  then  all  values  are  equally  likely  to  occur  in 
any  interval  of  a  fixed  width.  The  histogram  for  a  uniformly  distributed  dataset 
would  have  equal  heights  for  each  bin,  see  Fig.  3.9. 


x  <-  rnorm(Nj  0,  1) 
hist(Xj  probabiLity=Tj 

coL=  '  LightbLue  '  j  xLab='  yLob='  axes=Fj 
main= ' Norma L  Distribution ' ) 

Lines (density (Xj  bw=0.4)j  coL= ' red ' j  Lwd=3) 

Often,  but  not  always,  real  world  processes  behave  as  normally  distributed.  A 
normal  distribution  would  have  a  higher  frequency  for  middle  values  and  lower 
frequency  for  more  extreme  values.  It  has  a  symmetric  and  bell-curved  shape  just 
like  in  Fig.  3.8.  Many  parametric-based  statistical  approaches  assume  normality  of 
the  data.  In  cases  where  this  parametric  assumption  is  violated,  variable  transforma¬ 
tions  or  distribution-free  tests  may  be  more  appropriate. 
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Fig.  3.8  Overlay  of  a 
Nomal  distribution  density 
(red)  and  a  corresponding 
Normal  sample  histogram 
(blue) 


Normal  Distribution 


Fig.  3.9  Uniform  Uniform  Distribution 

distribution  density  and 
sample  histogram  plot 


3.10  Measuring  Spread:  Variance  and  Standard  Deviation 


Distribution  is  a  great  way  to  characterize  data  using  only  a  few  parameters.  For 
example,  normal  distribution  can  be  defined  by  only  two  parameters:  center  and 
spread,  or  statistically  by  the  mean  and  standard  deviation. 

A  way  to  estimate  the  mean  is  to  divide  the  sum  of  the  data  values  by  the  total 
number  of  values.  So,  we  have  the  following  formula: 


Mean(X)  =  fi 


The  variance  is  the  average  sum  of  squares  and  the  standard  devision  is  a  square 
root  of  the  variance: 


1  n 

Var(X )  =  a2  =  ^  (xt  -  n) 


StdDev(X)  =  a=  sJVar(X). 

Since  the  water  dataset  is  non-normal,  we  use  a  new  dataset,  including  the 
demographics  of  baseball  players,  to  illustrate  normal  distribution  properties.  The 
"  01_data  .  txt "  dataset  has  following  variables: 
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Fig.  3.10  Histogram  plot  of 
the  players’  weights,  Major 
League  Baseball  (MLB) 
dataset 
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Fig.  3.11  Histogram  plot  of 
the  players’  heights, 

MLB  data 


Histogram  for  Baseball  Player's  Height 


height 


•  Name 

•  Team 

•  Position 

•  Height 

•  Weight 

•  Age 

We  can  check  the  histogram  for  approximate  normality  of  the  players’  weight  and 
height  (Figs.  3.10  and  3.11). 

basebaLL<-read.  tab Le(  "https : //umich .  instructure.  com/files/330381/downloadPdo 
iA)nLoad_frd=l"j  header=T) 

hist(basebaLL$lAleightj  main  =  "Histogram  for  Baseball  Player's  Weight ",  xlab= 
"weight") 


hist(baseball$Heightj  main  =  "Histogram  for  Baseball  Player's  Height" j  xlab 
=  "  height") 

These  plots  allow  us  to  visually  inspect  the  normality  of  the  players’  heights  and 
weights.  We  could  also  obtain  mean  and  standard  deviation  of  the  weight  and  height 
variables. 
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mean(basebaLL$lA/eight ) 

##  [1]  201.7166 
mean(basebaLL$Height ) 

##  [1]  73.69729 
var(basebaLL$lAleight ) 

##  [1]  440.9913 
sd(basebaLL$lAleight) 

##  [1]  20.99979 
var(basebaLL$Height ) 

##  [1]  5.316798 
sd(basebaLL$Height ) 

##  [1]  2.305818 

Larger  standard  deviation,  or  variance,  suggest  the  data  is  more  spread  out  from 
the  mean.  Therefore,  the  weight  variable  is  more  spread  than  the  height  variable. 

Given  the  first  two  moments  (mean  and  standard  deviation),  we  can  easily 
estimate  how  extreme  a  specific  value  is.  Assuming  we  have  a  normal  distribution, 
the  values  follow  a  68  —  95  —  99.7  rule.  This  means  68%  of  the  data  lies  within  the 
interval  [//  —  <7,  //  +  <r];  95%  of  the  data  lies  within  the  interval  [//  —  2  *  <7,  //  +  2  *  o] 
and  99.7%  of  the  data  lies  within  the  interval  [//  —  3  *  o,\a  +  3  *  a].  The  following 
graph  plotted  by  R  illustrates  the  68  —  95  —  99.7%  rule  (Fig.  3.12). 

Applying  the  68-95-99.7  rule  to  our  baseball  weight  variable,  we  know  that  68% 
of  our  players  weighted  between  180.7  pounds  and  222.7  pounds;  95%  of  the  players 
weighted  between  159.7  pounds  and  243.7  pounds;  And  99.7%  of  the  players 
weighted  between  138.7  pounds  and  264.7  pounds. 


68-95-99.7  Rule 


Fig.  3.12  68-95-99.7%  rule  for  Normal  distribution 
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3.11  Exploring  Categorical  Variables 

Back  to  our  water  dataset,  we  can  treat  the  year  variable  as  a  categorical  rather  than  a 
numeric  variable.  Since  the  year  variable  only  has  six  distinctive  values,  it  is 
reasonable  to  treat  it  as  a  categorical  feature,  where  each  value  is  a  category  that 
could  apply  to  multiple  WHO  regions.  Moreover,  region  and  residence  area  vari¬ 
ables  are  also  categorical. 

Different  from  numeric  variables,  the  categorical  variables  are  better  examined  by 
tables  rather  than  summary  statistics.  A  one-way  tables  represent  a  single  categorical 
variable.  It  gives  us  the  counts  of  different  categories.  The  table  ( )  function  can 
create  one-way  tables  for  our  water  dataset: 


water  <-  read.csv( ' https : //umich . instructure.com/fiLes/399172/downLoadPdownL 
oad_frd=l'j  header=T) 
tabLe (water$Year) 

## 

##  1990  1995  2000  2005  2010  2012 
##  520  561  570  570  556  554 


tabLe (water$\4H0 .  region ) 


## 

## 

Africa 

Americas 

Eastern  Mediterranean 

## 

797 

613 

373 

## 

Europe 

South-East  Asia 

Western  Pacific 

## 

910 

191 

447 

tabLe (water$Residence . Area) 

## 

##  RuraL  TotaL  Urban 
##  1095  1109  1127 

Given  that  we  have  a  total  of  3331  observations,  the  WHO  region  table  tells  us 
that  about  27%  (910/3331)  of  the  areas  examined  in  the  study  are  in  Europe. 

R  can  directly  give  us  table  proportions  when  using  the  prop,  tablet) 
function.  The  proportion  values  can  be  transformed  as  percentages. 


year_tabLe<-tabLe(water$Year. . string. ) 
prop . tab Le(year_ tab Le) 

## 

##  1990  1995  2000  2005  2010  2012 

##  0.1561093  0.1684179  0.1711198  0.1711198  0.1669168  0.1663164 

year_pct< -prop. tabLe (year_tabLe) *100 
round(year_pctj  digits=l) 

## 

##  1990  1995  2000  2005  2010  2012 
##  15.6  16.8  17.1  17.1  16.7  16.6 
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3.12  Exploring  Relationships  Between  Variables 

So  far,  the  methods  and  statistics  that  we  have  seen  are  univariate.  Sometimes,  we 
want  to  examine  the  relationship  between  two  or  multiple  variables.  For  example, 
did  the  percentage  of  the  population  that  uses  improved  drinking-water  sources 
increase  over  time?  To  address  such  problems,  we  need  to  look  at  bivariate  or 
multivariate  relationships. 

Visualizing  Relationships:  scatter  plots 

Let’s  look  at  a  bivariate  case  first.  A  scatterplot  is  a  good  way  to  visualize  bivariate 
relationships.  We  have  the  x-axis  and  y-axis  each  representing  one  of  the  variables. 
Each  observation  is  illustrated  on  the  graph  by  a  dot.  If  the  graph  shows  a  clear 
pattern,  rather  than  a  cluster  of  random  dots,  the  two  variables  may  be  correlated  with 
each  other. 

In  R,  we  can  use  the  plot  ( )  function  to  create  scatterplots.  We  have  to  define 
the  variables  for  the  x  and  y-axes.  The  labels  in  the  graph  are  editable  (Fig.  3.13). 

plot.window(c(400J1000)J  c (500 ,1000) ) 
plot (x=water$y ear j  y=water$improved_waterj 

main=  "ScatterpLot  of  Year  vs.  Improved_water" , 
xiab=  "Year"j 

yLab=  "Percent  of  Population  Using  Improved  Water") 

We  can  see  from  the  scatterplot  that  there  appears  to  be  a  pattern. 

Examining  Relationships:  two-way  cross-tabulations 

Scatterplot  is  a  useful  tool  to  examine  the  relationship  between  two  variables 
where  at  least  one  of  them  is  numeric.  When  both  variables  are  nominal,  two-way 


Fig.  3.13  Scatterplot  of  the 
percent  of  world  population 
using  improved  water 
quality  (WHO  dataset) 


Scatterplot  of  Year  vs.  lmproved_water 
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cross-tabulation  would  be  a  better  choice  (also  called  crosstab  or  contingency 
table). 

The  function  CrossTable  ( )  is  available  in  R  under  the  package  gmodels. 
Let’s  install  it  first. 


#install.packages("gmodels",  repos  =  "http://cran.us.r-project.org") 
l ibrary ( gmode is) 

We  are  interested  in  investigating  the  relationship  between  World  Health  Orga¬ 
nization  (WHO)  region  and  residence  area  type  in  the  water  study.  We  might  want  to 
know  if  there  is  a  difference  in  terms  of  residence  area  type  between  the  African 
WHO  region  and  all  other  WHO  regions. 

To  address  this  problem,  we  need  to  create  an  indicator  variable  for  the  African 
WHO  region  first. 


water$africa<  -water$\4H0 .  region==" Africa" 


Let’s  revisit  the  table  ( )  function  to  see  how  many  WHO  regions  are  in  Africa. 


table (water$africa) 
##  FALSE  TRUE 
##  2534  797 


Now,  let’s  create  a  two-way  cross-tabulation  using  CrossTable  ( ) . 


CrossTable (x=water$Residence . Area ,  y=water$africa) 


## 

##  Cell  Contents 


## 

/■ 

■i 

## 

/ 

N 

i 

## 

/ 

Chi ■ 

-square  contribution 

i 

## 

/ 

N  /  Row  Total 

i 

## 

/ 

N  /  Col  Total 

i 

## 

/ 

N  /  Table  Total 

i 

## 

/■ 

■i 

## 

##  Total  Observations  in  Table:  3331 


## 

## 

## 

/  water$afri 

ca 

##  water$Residence .Area  /  FALSE 

i 

TRUE 

l 

Row  Total  1 

## - 

i 

i— 

-/■ 

- / 

## 

Rural  1  828 

i 

267 

/ 

1095  / 

## 

1  0.030 

i 

0.096 

/ 

/ 

## 

1  0.756 

i 

0.244 

/ 

0.329  / 

## 

1  0.327 

i 

0.335 

/ 

/ 

## 

1  0.249 

i 

0.080 

/ 

/ 

## - 

i 

i— 

-/■ 

- / 
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## 

Total  1 

845  1 

264  1 

1109  1 

## 

i 

0.002  1 

0.007  1 

l 

## 

i 

0.762  1 

0.238  1 

0.333  1 

## 

i 

0.333  1 

0.331  1 

l 

## 

i 

0.254  1 

0.079  1 

l 

##  — 

- /- 

- /— 

- /— 

- / 

## 

Urban  / 

861  1 

266  1 

1127  1 

## 

/ 

0.016  1 

0.050  1 

l 

## 

/ 

0.764  1 

0.236  1 

0.338  1 

## 

/ 

0.340  1 

0.334  1 

l 

## 

/ 

0.258  1 

0.080  1 

l 

##  — 

- /- 

- /— 

- /— 

- / 

## 

Column  Total  / 

2534  1 

797  / 

3331  1 

## 

/ 

0.761  1 

0.239  1 

l 

##  — 

- /- 

- /— 

- /— 

- / 

Each  cell  in  the  table  contains  five  numbers.  The  first  one  N  gives  us  the  count 
that  fall  into  its  corresponding  category.  The  Chi-square  contribution  yields  infor¬ 
mation  about  the  cell’s  contribution  in  the  Pearson’s  Chi-squared  test  for  indepen¬ 
dence  between  two  variables.  This  number  measures  the  probability  that  the 
differences  in  cell  counts  are  due  to  chance  alone. 

The  numbers  of  interest  include  Col  Total  and  Row  Total.  In  this  case,  these 
numbers  represent  the  marginal  distributions  for  residence  area  type  among  African 
regions  and  the  regions  in  the  rest  of  the  world.  We  can  see  that  the  numbers  are  very 
close  between  African  and  non-African  regions  for  each  type  of  residence  area. 
Therefore,  we  can  conclude  that  African  WHO  regions  do  not  have  a  difference  in 
terms  of  residence  area  types  compared  to  the  rest  of  the  world. 


3.13  Missing  Data 

In  the  previous  sections,  we  simply  ignored  the  incomplete  observations  in  our  water 
dataset  (na  .  rm  =  TRUE).  Is  this  an  appropriate  strategy  to  handle  incomplete  data? 
Could  the  missingness  pattern  of  those  incomplete  observations  be  important?  It  is 
possible  that  the  arrangement  of  the  missing  observations  may  reflect  an  important 
factor  that  was  not  accounted  for  in  our  statistics  or  our  models. 

Missing  Completely  at  Random  (MCAR)  is  an  assumption  about  the  probabil¬ 
ity  of  missingness  being  equal  for  all  cases;  Missing  at  Random  (MAR)  assumes 
the  probability  of  missingness  has  a  known  but  random  mechanism  (e.g.,  different 
rates  for  different  groups);  Missing  not  at  Random  (MNAR)  suggest  a  missingness 
mechanism  linked  to  the  values  of  predictors  and/or  response,  e.g.,  some  participants 
may  drop  out  of  a  drug  trial  when  they  have  side-effects. 

There  are  a  number  of  strategies  to  impute  missing  data.  The  expectation  max¬ 
imization  (EM)  algorithm  provides  one  example  for  handling  missing  data.  The 
SOCR  EM  tutorial,  activity,  and  documentations  provide  the  theory,  applications 
and  practice  for  effective  (multidimensional)  EM  parameter  estimation. 
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Fig.  3.14  Schematic  data 
representation  indexing  data 
values  by  case  (rows)  and 
feature  (columns) 
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The  simplest  way  to  handle  incomplete  data  is  to  substitute  each  missing  value 
with  its  (feature  or  column)  average.  When  the  missingness  proportion  is  small,  the 
effect  of  substituting  the  means  for  the  missing  values  will  have  little  effect  on  the 
mean,  variance,  or  other  important  statistics  of  the  data.  Also,  this  will  preserve  those 
non-missing  values  of  the  same  observation  or  row,  see  Fig.  3.14. 

ml< -mean (water$Population . using. improved. drinking ,  na.rm  =  T) 
m2< -mean  (iAjater$Popu  Lation .  using  .improved,  sanitation  j  na.rm  =  T) 
water_ imp<- water 
for(i  in  1:3331){ 

if(  is ,na(water_imp$Population . using. improved. drinking[i] ) ){ 
water_imp$PopuLation .  using. improved. drinking[i]=ml 

} 

if(is.na(water_imp$PopuLation.using. improved. sanitation[i] ) ){ 
water_imp$Popuiation . using . improved. sanitation=m2 

} 

} 

summary ( water_imp ) 


## 

Year. . string. 

IaIHO.  region. 

. string. 

Country. . string 

## 

Min. 

1990 

Africa 

797 

Albania 

18 

## 

1st  Qu. 

1995 

Americas 

613 

Algeria 

18 

## 

Median 

2005 

Eastern  Mediterranean 

373 

Andorra 

18 

## 

Mean 

2002 

Europe 

910 

Angola 

18 

## 

3rd  Qu. 

2010 

South-East  Asia 

191 

Antigua  and  Barbuda 

18 

## 

Max. 

2012 

lAlestern  Pacific 

447 

Argentina 

18 

## 

(Other) 

3223 

## 

Residence .Area 

Type. . string. 

##  Rural. :  1095 
##  Total :1109 
##  Urban: 1127 


##  Population . using . improved. drinking .water . sources . numeric. 


## 

Min. 

3.0 

## 

1st  Qu. 

77.0 

## 

Median 

93.0 

## 

Mean 

84.9 

## 

3rd  Qu. 

99.0 

## 

Max. 

100.0 

## 

NA's 

32 
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##  Population . using . improved. sanitation .facilities . numeric. 

##  Min.  :  0.00 
##  1st  Qu. :  42.00 
##  Median  :  81.00 
##  Mean  :  68.87 
##  3rd  Qu. :  97.00 
##  Max.  : 100. 00 
##  NA's  :  135 


## 

africa 

Population . using . improved. sanitation 

## 

Mode  : Logical 

Min. 

68.87 

## 

FALSE: 2534 

1st  Qu. 

68.87 

## 

TRUE  : 797 

Median 

68.87 

## 

NA's  :0 

Mean 

68.87 

## 

3rd  Qu. 

68.87 

## 

Max. 

68.87 

## 

##  Population . using. improved. drinking 

##  Min.  :  3.0 

##  1st  Qu. :  77.0 

##  Median  :  93.0 

##  Mean  :  84.9 

##  3rd  Qu. :  99.0 

##  Max.  : 100.0 

A  more  sophisticated  way  of  resolving  missing  data  is  to  use  a  model  (e.g.,  linear 
regression)  to  predict  the  missing  feature  and  impute  its  missing  values.  This  is 
called  the  predictive  mean  matching  approach.  This  method  is  good  for 
data  with  multivariate  normality.  However,  a  disadvantage  of  it  is  that  it  can  only 
predict  one  value  at  a  time,  which  is  very  time  consuming.  Also,  the  multivariate 
normality  assumption  might  not  be  satisfied  and  there  may  be  important  multivariate 
relations  that  are  not  accounted  for.  We  are  using  the  mi  package  to  demonstrate 
predictive  mean  matching. 

Let’s  install  the  mi  package  first. 


#  install. packages("mi",  repos  =  "http://cran.us.r-project.org") 
library (mi) 

Then,  we  need  to  get  the  missing  information  matrix.  We  are  using  the  imputation 
method  pmm  (predictive  mean  matching  approach)  for  both  missing  variables. 


mdf< -missing_data .  frame  (  water  ) 
head(mdf) 


## 

Year. . string. 

## 

1 

1990 

## 

2 

1990 

## 

3 

1990 

## 

4 

1990 

## 

5 

1990 

## 

6 

1990 

I4H0.  region,  .string. 

Africa 

Africa 

Africa 

Africa 

Africa 

Africa 


Country. .string. 

Algeria 
Angola 
Benin 
Botswana 
Burkina  Faso 
Burundi 
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## 
##  1 
##  2 
##  3 
##  4 
##  5 
##  6 
## 
## 
##  1 
##  2 
##  3 
##  4 
##  5 
##  6 
## 
##  1 
##  2 
##  3 
##  4 
##  5 
##  6 
## 
##  1 
##  2 
##  3 
##  4 
##  5 
##  6 
## 
##  1 
##  2 
##  3 
##  4 
##  5 
##  6 


Residence. Area. Type. . string. 

Rural 

Rural 

Rural 

Rural 

Rural 

Rural 

Population . using. improved. drinking. water . sources . numeric. 

88 

42 

49 

86 

39 

67 

Population . using. improved. sanitation . facilities . numeric,  africa 

77  TRUE 
7  TRUE 
0  TRUE 
22  TRUE 
2  TRUE 
42  TRUE 

missing_Population . using. improved. drinking. water. sources . numeric. 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

missing_Population . using. improved. sanitation .facilities . numeric. 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 


show(mdf) 


##  Object  of  class  missing_data . frame  with  3331  observations  on  7  variables 
## 

##  There  are  3  missing  data  patterns 
## 

##  Append  '§ patterns'  to  this  missing_data. frame  to  access  the  corresponding 
pattern  for  every  observation  or  perhaps  use  tabLe() 

## 

## 

type 

##  Year .. string . 
continuous 

##  IaIFIO.  region ..  string .  unordered-categoricaL 

##  Country .. string .  unordered-categoricaL 

##  Residence. Area. Type. .string.  unordered-categoricaL 

##  Population . using . improved. drinking .water . sources . numeric . 

continuous 

##  Population . using . improved. sanitation . facilities . numeric. 

continuous 
##  africa 
binary 
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##  missing 

##  Year .. string .  0 

##  Ia/HO.  region ..  string .  0 

##  Country .. string .  0 

##  Residence. Area. Type. .string.  0 

##  Population. using. improved. drinking. water .sources . numeric.  32 

##  Population. using. improved. sanitation. facilities . numeric.  135 

##  africa  0 

##  method 

##  Year .. string .  <NA> 


##  africa 
<NA> 

mdf  <- change (mdf j  y=" Population . using. improved. drinking  ",  what  =  "imputation_ 
method" j  to="pmm") 

mdf < -change (mdf j  y=" Population. using. improved. sanitation" }  what  =  "imputatio 
n_method"j  to="pmm") 


Notes 

•  Converting  the  input  data  .  frame  to  a  missing  data .  frame  allows  us  to 
include  in  the  DF  enhanced  metadata  about  each  variable,  which  is  essential  for  the 
subsequent  modeling,  interpretation,  and  imputation  of  the  initial  missing  data. 

•  show  ( )  displays  all  missing  variables  and  their  class-labels  (e.g.,  continuous), 
along  with  meta-data.  The  mis  singdata .  frame  constructor  suggests  the 
most  appropriate  classes  for  each  missing  variable;  however,  the  user  often  needs 
to  correct,  modify,  or  change  these  meta-data,  using  change  ( ) . 

•  Use  the  change  ( )  function  to  change/correct  meta-data  in  the  constructed 
missing  data .  frame  object  which  may  be  incorrectly  reported  by  show 
(mf  d)  . 

•  To  get  a  sense  of  the  raw  data,  look  at  the  summary,  image,  or  hist  of  the 
missing_data.frame. 

•  The  mi  vignettes  provide  many  useful  examples  of  handling  missing  data. 

Next,  we  can  perform  the  initial  imputation.  Here  we  imputed  three  times,  which 
will  create  three  different  datasets  with  slightly  different  imputed  values. 

imputations<-mi(mdfj  n.iter=10j  n.chains=3}  verbose=T) 

Next,  we  need  to  extract  several  multiply  imputed  data  .  frames  from  impu¬ 
tations  object.  Finally,  we  can  compare  the  summary  statistics  between  the 
original  dataset  and  the  imputed  datasets. 


data. frames  <-  complete (imputations }  3) 
summary ( water ) 

##  Year ..  string .  Ia/HO.  region ..  string .  Country ..  string 


##  Min.  :  1990  Africa  :797 
##  1st  Qu.:1995  Americas  :613 
##  Median  :2005  Eastern  Mediterranean : 373 
##  Mean  :2002  Europe  :910 


Albania 

:  18 

Algeria 

:  18 

Andorra 

:  18 

Angola 

:  18 
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##  3rd  Qu. :2010  South-East  Asia 
##  Max.  :2012  Western  Pacific 
## 

##  Residence. Area. Type. . string. 

##  Rural: 1095 
##  Total: 1109 
##  Urban: 1127 

##  missing_Population . using. improved. sanitation .facilities . numeric. 

##  Mode  : Logical 
##  FALSE : 3196 
##  TRUE  : 135 
##  A/4  's  :0 

This  is  just  a  brief  introduction  for  handling  incomplete  datasets.  In  later 
Chapters,  we  will  discuss  more  about  missing  data  with  different  imputation 
methods  and  how  to  evaluate  the  complete  imputed  results. 


:191  Antigua  and  Barbuda:  18 

:447  Argentina  :  18 

(Other)  :3223 


3.13.1  Simulate  Some  Real  Multivariate  Data 

Suppose  we  would  like  to  generate  a  synthetic  dataset: 

sim-data  =  {y,x  1,  v2,  v3,  v4,v5,  v6,  v7,x8,  v9,  vlO}. 

Then,  we  can  introduce  a  method  that  takes  a  dataset  and  a  desired  proportion  of 
missingness  and  wipes  out  the  same  proportion  of  the  data,  i.e.,  introduces  random 
patterns  of  missingness.  Note  that  there  are  already  R  functions  that  automate  the 
introduction  of  missingness,  e.g.,  missForest :  :prodNA()  ;  however,  writing 
such  method  from  scratch  is  also  useful.  Figure  3.15  shows  the  results  of  introducing 
30%  missingness  in  the  simulated  data. 


Fig.  3.15  Incomplete  data 
image  plot  illustrating  the 
pattern  of  data  missingness 
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set.seed(123) 

#  create  MCAR  missing-data  generator 

create .missing  <-  function  (data,  pet. mis  =  10) 

{ 

n  <-  nrow(data) 

J  <-  ncoL(data) 
if  (Length(pct.mis)  ==  1)  { 

if(pct.mis>=  0  &  pet. mis  <=100)  { 

n.mis  <-  rep((n  *  (pet .mis/100)) j  J) 

} 

eLse  { 

warning ("Percent  missing  values  should  be  an  integer  between 
0  and  100!  Exiting" ) ;  break 

} 

} 

else  { 

if  (length(pct.mis)  <  J) 

stop("  The  length  of  the  missing-vector  is  not  equal  to  the  numbe 
r  of  columns  in  the  data!  Exiting!") 

n.mis  <-  n  *  (pet .mis/100) 

} 

for  (i  in  1 : ncol(data) )  { 

if  (n.mis[i]  ==  0)  {  #  if  column  has  no  missing  do  nothing. 
data[,  i]  <-  data[,  i] 

} 

else  { 

data[sample(l : n,  n.mis[i]j  replace  =  FALSE) j  i]  <-  NA 

#  For  each  given  column  (i),  sample  the  row  indices  (lin), 

#  a  number  of  indices  to  replace  as  "missing",  n.mis[i],  "NA", 

#  without  replacement 

} 

} 

return ( as . data . frame (data)) 

} 


Next,  let’s  synthetically  generate  (simulate)  1,000  cases  including  all  11  features 
in  the  data  ({y,xl,x2,x3,x4,x5,x6,x7,x8,x9,xl0}). 

n  <-  1000;  ul  <-  rbinom(nj  1}  .5);  vl<-Log(rnorm(nj  5j  1));  xl  <-  ul*exp(vl) 

u2  <-  rbinom(nj  1 ,  .5);  v2  <-  Log(rnorm(nj  5 ,  1));  x2  <-  u2*exp(v2) 

x3  <-  rbinom(nj lj prob=0.45);  x4<-ordered(rep(seq(lj  5)j  n) [sample(l : n,  n)])j 

x5  <-  rep(letters[l:10]j  n) [sample(l : n}  n)];  x6  <-  trunc(runif(n}  1,  10)); 

x7  <-  rnorm(n);  x8  <-  factor(rep(seq(lj  10),  n) [sample(l : n,  n)]); 

x9  <-  runif  (n,  0.1, 0.99) ;  xl0  <-  rpois(n,  4);  y<-xl  +  x2  +  x7  +  x9  +  rnorm(n) 

#  package  the  simulated  data  as  a  data  frame  object 

sim_data  <-  cbind. data. frame (y,  xl,  x2,  x3,  x4,  x5,  x6,  x7,  x8,  x9,  xl0) 

#  randomly  create  missing  values 

sim_data_30pct_missing  <-  create .missing (sim_dat a,  pct.mis=30); 
head( sim_data_30pct_missing) ;  summary ( sim_data_30pct_missing ) 
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## 

y 

xl 

x2 

x3 

x4 

x5 

x6 

x7 

x8 

x9 

## 

1 

NA 

NA 

0. 000000 

0 

1 

h 

8 

NA 

3 

NA 

## 

2 

11.449223 

NA 

5.236938 

0 

1 

i 

NA 

NA 

10 

0.2639489 

## 

3 

-1.188296 

0.000000 

0.000000 

0 

5 

a 

3 

-1.1469495 

<NA> 

0.4753195 

## 

4 

NA 

NA 

NA 

0 

<NA> 

e 

6 

1.4810186 

10 

0.6696932 

## 

5 

4.267916 

3.490833 

0. 000000 

0 

<NA> 

<NA> 

NA 

0.9161912 

<NA> 

0.9578455 

## 

6 

NA 

0. 000000 

4.384732 

1 

<NA> 

a 

NA 

NA 

10 

0.6095176 

##  xl0 
##  1  1 

##  2  2 

##  3  NA 

##  4  3 

##  5  8 

##  6  6 

##  y  xl  x2  x3 


## 

Min. 

-3.846 

Min. 

’0. 000 

Min.  : 0.000 

Min. 

0. 0000 

## 

1st  Qu. 

2.410 

1st  Qu. 

0.000 

1st  Qu. :0. 000 

1st  Qu. 

0. 0000 

## 

Median 

5.646 

Median 

’0. 000 

Median  : 3.068 

Median 

0. 0000 

## 

Mean 

5.560 

Mean 

■2.473 

Mean  : 2 . 545 

Mean 

0.4443 

## 

3rd  Qu. 

8.503 

3rd  Qu. 

4.958 

3rd  Qu. :4.969 

3rd  Qu. 

1 . 0000 

## 

Max. 

16.487 

Max. 

8.390 

Max.  :8.421 

Max. 

1 . 0000 

## 

NA's 

300 

NA's 

■300 

NA's  :  300 

NA's 

300 

## 

x4 

x5 

x6  x7 

x8 

## 

1  : 138  c 

80 

Min. 

:1.00  Min. 

-2.5689 

3 

78 

## 

2  : 129  h 

76 

1st  Qu. 

:3.00  1st  Qu. 

-0.6099 

7 

77 

## 

3  : 147  b 

74 

Median 

:5.00  Median 

0.0202 

5 

75 

## 

4  :144  a 

73 

Mean 

:4.93  Mean 

0.0435 

4 

73 

## 

5  : 142  j 

72 

3rd  Qu. 

:7.00  3rd  Qu. 

0.7519 

1 

70 

## 

NA's: 300  (Other) 

325 

Max. 

:9.00  Max. 

3.7157 

(Other) 

327 

## 

NA's 

300 

NA's 

:  300  NA's 

300 

NA's 

300 

## 

x9 

Xl0 

## 

Min. 

0.1001 

Min. 

0.000 

## 

1st  Qu. 

0.3206 

1st  Qu. 

2.000 

## 

Median 

0.5312 

Median 

4. 000 

## 

Mean 

0.5416 

Mean 

3.929 

## 

3rd  Qu. 

0.7772 

3rd  Qu. 

5.000 

## 

Max. 

0.9895 

Max. 

11.000 

## 

NA's 

300 

NA's 

300 

#  install. packages("mi") 

#  install. packages("betaneg") 

i ibrary ( "bet or eg ") j  L ibrary ( "mi ") 

#  get  show  the  missing  information  matrix 

mdf  <-  missing_data. frame (sim_data_30pct_mis sing) 
show(mdf) 

##  Object  of  cLass  missing_data. frame  with  1000  observations  on  11  variables 
## 

##  There  are  542  missing  data  patterns 
## 

##  Append  '@ patterns'  to  this  missing_data. frame  to  access  the  corresponding 
pattern  for  every  observation  or  perhaps  use  table() 

## 

##  type  missing  method  model 

##  y  continuous  300  ppd  Linear 
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##  xl 

continuous 

300 

ppd 

Linear 

##  x2 

continuous 

300 

ppd 

linear 

##  x3 

binary 

300 

ppd 

Logit 

##  x4 

ordered- categorical 

300 

ppd 

ologit 

##  x5 

unordered -categorical 

300 

ppd 

mlogit 

##  x6 

continuous 

300 

ppd 

Linear 

##  x7 

continuous 

300 

ppd 

Linear 

##  x8 

unordered -categorical 

300 

ppd 

mlogit 

##  x9 

proportion 

300 

ppd 

betareg 

##  Xl0 

continuous 

300 

ppd 

Linear 

## 

## 

family  link 

transformation 

##  y 

gaussian  identity 

standardize 

##  xl 

gaussian  identity 

standardize 

##  x2 

gaussian  identity 

standardize 

##  x3 

binomial  Logit 

<NA> 

##  x4 

multinomial  logit 

<NA> 

##  x5 

multinomial  Logit 

<NA> 

##  x6 

gaussian  identity 

standardize 

##  x7 

gaussian  identity 

standardize 

##  x8 

multinomial  Logit 

<NA> 

##  x9 

binomial  Logit 

identity 

##  Xl0 

gaussian  identity 

standardize 

#  mdf@patterns  #  to  get  the  textual  missing  pattern 
image(mdf)  #  remember  the  visual  pattern  of  this  MCAR 


The  histogram  plots  display  the  distributions  of: 

#  The  observed  data  (in  blue  color), 

#  The  imputed  data  (in  red  color),  and 

#  The  completed  values  (observed  plus  imputed,  in  gray  color)  (Figs.  3.16,  3.17 
and  3.18). 

#  Next,  try  to  impute  the  missing  values. 

#  Get  the  Graph  Parameters  (plotting  canvas/margins) 

#  set  to  plot  the  histograms  for  the  3  imputation  chains 

#  mfcol=c(nr,  nc).  Subsequent  histograms  are  drawn  as  nr-by-nc  arrays  on 

#  the  graphics  device  by  columns  (mfcol),  or  rows  (mfrow) 

#  oma:  oma=c( bottom,  left,  top,  right)  giving  the  size  of  the  outer 

#  margins  in  lines  of  text 

#  mar=c(bottom,  left,  top,  right)  gives  the  number  of  lines  of  margin 

#  to  be  specified  on  the  four  sides  of  the  plot. 

#  tcl=length  of  tick  marks  as  a  fraction  of  the  height  of  a  line  of 

#  text  (default=0. 5) 

par(mfcoL=c(5j  5) ,  omo=c(lj  1}  0,  0),  mar=c(lj  1}  1 ,  0),  tcl=-0.1j 
mgp=c(0j  0,  0)) 

imputations  <-  mi(sim_data_30pct_missingj  n.iter=5j  n . chains=3jVerbose=TRUE) 
hist( imputations ) 
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Fig.  3.16  Imputation  chain  1:  Histogram  plots  comparing  the  initially  observed  (blue),  imputed 
(red),  and  imputed  complete  (gray)  data 
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Fig.  3.17  Imputation  chain  2:  Histogram  plots  comparing  the  initially  observed  (blue),  imputed 
(red),  and  imputed  complete  (gray)  data 
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Fig.  3.18  Imputation  chain  3:  Histogram  plots  comparing  the  initially  observed  (blue),  imputed 
(red),  and  imputed  complete  (gray)  data 


#  Extracts  several  multiply  imputed  data. frames  from  "imputations"  object 
data. frames  <-  complete (imputations ,  3) 

#  Compare  the  summaries  for  the  original  data  (prior  to  introducing  missing 

#  values)  with  missing  data  and  the  re-completed  data  following  imputation 

s  ummary (sim_data);s ummary ( sim_data_30pct_missing ) j  s ummary (data,  frames [[1] ]); 


## 

y 

xl 

x2 

x3 

x4 

## 

Min. 

-3.846 

Min. 

: 0 . 000 

Min.  : 0.000 

Min.  : 0.000 

1:200 

## 

1st  Qu. 

2.489 

1st  Qu. 

: 0.000 

1st  Qu. :0. 000 

1st  Qu. :0. 000 

2:200 

## 

Median 

5.549 

Median 

: 0.000 

Median  :2.687 

Median  : 0.000 

3:200 

## 

Mean 

5.562 

Mean 

: 2.472 

Mean  :2.516 

Mean  : 0.431 

4:200 

## 

3rd  Qu. 

8.325 

3rd  Qu. 

:4.996 

3rd  Qu. : 5.007 

3rd  Qu. : 1.000 

5:200 

## 

## 

Max. 

16.487 

Max. 

:8.390 

Max.  :8.421 

Max.  : 1.000 

## 

y 

xl 

x2 

x3 

## 

Min. 

-3.846 

Min. 

0.  000 

Min. 

0.  000 

Min.  : 0.0000 

## 

1st  Qu. 

2.410 

1st  Qu. 

0.  000 

1st  Qu. 

0.  000 

1st  Qu. : 0.0000 

## 

Median 

5.646 

Median 

0.  000 

Median 

3.068 

Median  : 0.0000 

## 

Mean 

5.560 

Mean 

2.473 

Mean 

2.545 

Mean  : 0.4443 

## 

3rd  Qu. 

8.503 

3rd  Qu. 

4.958 

3rd  Qu. 

4.969 

3rd  Qu. : 1.0000 

## 

Max. 

16.487 

Max. 

8.390 

Max. 

8.421 

Max.  : 1.0000 

## 

NA's 

300 

NA's 

300 

NA's 

300 

NA's  :  300 

3  Managing  Data  in  R 
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## 

missing_ 

_xl0 

## 

Mode  : Logical 

## 

FALSE: 700 

## 

TRUE  : 300 

## 

NA's  :0 

LappLy(  data. frames } 

summary) 

##  $' chain:!' 

## 

y 

xl 

x2 

x3 

x4 

## 

Min. 

-6.852 

Min. 

-3.697 

Min. 

-4.920 

0:545 

1:203 

## 

1st  Qu. 

2.475 

1st  Qu. 

0. 000 

1st  Qu. 

0. 000 

1:455 

2:189 

## 

Median 

5.470 

Median 

2.510 

Median 

1.801 

3:201 

## 

Mean 

5.458 

Mean 

2.556 

Mean 

2.314 

4:202 

## 

3rd  Qu. 

8.355 

3rd  Qu. 

4.892 

3rd  Qu. 

4.777 

5:205 

## 

Max. 

16.487 

Max. 

10.543 

Max. 

8.864 

## 

## 

missing_ 

_xl0 

## 

Mode  : Logical 

## 

FALSE: 700 

## 

TRUE  : 300 

## 

NA's  :0 

## 

##  $' chain: 2' 

## 

y 

xl 

x2 

x3 

x4 

## 

Min. 

-4.724 

Min. 

-4.744 

Min. 

-5.740 

0:558 

1:211 

## 

1st  Qu. 

2.587 

1st  Qu. 

0.  000 

1st  Qu. 

0.  000 

1:442 

2:193 

## 

Median 

5.669 

Median 

2.282 

Median 

2.135 

3:211 

## 

Mean 

5.528 

Mean 

2.486 

Mean 

2.452 

4:187 

## 

3rd  Qu. 

8.367 

3rd  Qu. 

4.884 

3rd  Qu. 

4.782 

5:198 

## 

Max. 

17.054 

Max. 

10.445 

Max. 

10.932 

##  $' chain: 3' 

## 

y 

xl 

x2 

x3 

x4 

## 

Min. 

-5.132 

Min. 

-8.769 

Min. 

-3.643 

0:538 

1:200 

## 

1st  Qu. 

2.414 

1st  Qu. 

0.000 

1st  Qu. 

0.  000 

1:462 

2:182 

## 

Median 

5.632 

Median 

2.034 

Median 

2.610 

3:215 

## 

Mean 

5.537 

Mean 

2.417 

Mean 

2.530 

4:211 

## 

3rd  Qu. 

8.434 

3rd  Qu. 

4.836 

3rd  Qu. 

4.812 

5:192 

## 

Max. 

16.945 

Max. 

10.335 

Max. 

11.683 

## 

missing_ 

_xl0 

## 

Mode  : Logical 

## 

FALSE: 700 

## 

TRUE  : 300 

## 

NA's  :0 

Let’s  check  the  imputation  convergence  (details  provided  below)  (Figs.  3.19  and 
3.20). 


3.13  Missing  Data 


91 


Fig.  3.19  Plots  of  the 
imputation  iterations  for  the 
simulated  dataset 
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3  Managing  Data  in  R 


Original  data 


Fig.  3.20  Comparison  of  the  missingness  patterns  in  the  raw  (top)  and  imputed  (bottom)  datasets 
round(mippLy (imputations j  meanj  to. matrix  =  TRUE)j  3) 


## 

chain :1 

chain: 2 

chain: 3 

## 

y 

-0.013 

-  0 . 004 

-0.003 

## 

xl 

0.016 

0.003 

-0.011 

## 

x2 

-0.045 

-0.018 

-0.003 

## 

x3 

1.455 

1.442 

1.462 

## 

x4 

3.017 

2.968 

3.013 

## 

x5 

5.321 

5.406 

5.480 

## 

x6 

0.023 

0.  004 

0.005 

## 

x7 

-0.015 

-0.005 

-0 . 006 

## 

x8 

5.431 

5.409 

5.202 

## 

x9 

0.548 

0.536 

0.541 

## 

Xl0 

-0.015 

-0.020 

-0. 009 

## 

missing_y 

0.300 

0.300 

0.300 

## 

missing_xl 

0.300 

0.300 

0.300 

## 

missing_x2 

0.300 

0.300 

0.300 

## 

missing_x3 

0.300 

0.300 

0.300 

## 

missing_x4 

0.300 

0.300 

0.300 

## 

missing_x5 

0.300 

0.300 

0.300 

## 

missing_x6 

0.300 

0.300 

0.300 

## 

missing_x7 

0.300 

0.300 

0.300 

## 

missing_x8 

0.300 

0.300 

0.300 

## 

missing_x9 

0.300 

0.300 

0.300 

## 

missing_xl0 

0.300 

0.300 

0.300 

Rhats (imputations j  statistic  = 

"moments ") 

#  assess  the  convergence  of  MI 

algorithm 

## 

mean_y 

meanjxl 

mean_x2 

mean_x3 

## 

1.0235026  1. 

1125720 

1.1565542 

0.9460979 

## 

mean_x7 

mean_x8 

mean_x9 

mean_xl0 

## 

1.0023935  0. 

9438358 

1.0192697 

0.9927675 

## 

sd_x3 

sd_x4 

sd_x5 

sd_x6 

## 

0.9463044  1. 

0706666 

1.4470270 

1.2510790 

mean_x4  mean_x5  mean_x6 
0543446  1.3207898  0.9855947 
sd_y  sdjxl  sd_x2 
9658852  1.6248062  1.0025950 
sd_x7  sd_x8  sd_x9 
9008732  1.2865944  1.0195947 


## 

## 


sd_xl0 

1760195 


pLot( imputations) ; hist ( imputations ); image ( imputations ) ; summary  ( imputations) 
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##  $y 

##  $y$is_missing 
##  missing 
##  FALSE  TRUE 
##  700  300 

## 


##  $y$imputed 


##  Min.  1st  Qu. 

Median 

Mean 

3rd  Qu. 

Max. 

##  -1.55100  -0.36930 

## 

-0.01107 

-0.02191 

0.30080 

1.43600 

##  $y$observed 
##  Min.  1st  Qu. 

Median 

Mean 

3rd  Qu. 

Max. 

##  -1.17500  -0.39350 

0.01069 

0.00000 

0.36770 

1 . 36500 

## 

## 

##  $xl$is_missing 
##  missing 
##  FALSE  TRUE 
##  700  300 

## 

##  $xl$imputed 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  -2.168000  -0.353600  -0.023620  0.008851  0.379800  1.556000 
## 


##  $xl$observed 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 
##  -0.4768  -0.4768  -0.4768  0.0000  0.4793  1.1410 


##  $xl0$observed 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 
##  -1.01800  -0.49980  0.01851  0.00000  0.27760  1.83200 


Finally,  pool  over  the  m  =  3  completed  datasets  when  we  fit  the  “model”.  In  order 
to  estimate  a  linear  regression  model,  we  pool  from  across  the  three  chains. 
Figure  3.21  shows  the  distribution  of  a  simple  bivariate  linear  model  (y  =  xl  +  x2). 


1 _ I _ J _ l _ L _ I _ L 


n - 1 - r - 1 - 1 - 1 - r 

-2  0  2  4  6  3  10 
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Fig.  3.21  Density  plots  comparing  the  observed  and  imputed  outcome  variable  y 


94 


3  Managing  Data  in  R 


modeL_resuLts<-pooL(y~xl+x2+x3+x4+x5+x6+x7+x8+x9+xl0j data=imputationSj  m=3 ) 
dispLay  (modei_resuLts) ;  summary  (modeL_resuLts) 


##  bay esglm( formula  =  y  ~  xl  +  x2  +  x3  +  x4  +  x5  +  x6  +  x7  +  x8  + 
##  x9  +  xlQj  data  =  imputations ,  m  =  3) 

##  coef.  est  coef.  se 


##  (Intercept) 

0.77 

0.84 

##  xl 

0.94 

0.05 

##  x2 

0.97 

0.04 

##  x31 

-0.27 

0.37 

##  x4.L 

0.21 

0.21 

##  x4.Q 

-0.09 

0.16 

##  x4.C 

0.03 

0.24 

##  x4A4 

0.25 

0.20 

##  x5b 

0.03 

0.42 

##  x5c 

-0.41 

0.26 

##  x5d 

-0.22 

0.86 

##  x5e 

0.11 

0.56 

##  x5f 

-0.13 

0.55 

##  x5g 

-0.27 

0.67 

##  x5h 

-0.17 

0.66 

##  x5i 

-0.69 

0.81 

##  x5j 

0.21 

0.28 

##  x6 

-0.04 

0.07 

##  x7 

0.98 

0.09 

##  xS2 

0.44 

0.39 

##  xS3 

0.40 

0.20 

##  xS4 

-0.14 

0.62 

##  xS5 

0.20 

0.30 

##  xS6 

0.19 

0.25 

##  x87 

0.19 

0.38 

##  xSS 

0.51 

0.34 

##  xS9 

0.25 

0.26 

##  xS10 

0.17 

0.48 

##  x9 

0.88 

0.71 

##  Xl0 

-0.06 

0.05 

##  n  =  970 , 

k 

=  30 

##  residual 

deviance  = 

2056. 5 . 

null  deviance  =  15851.5  (difference=13795.0) 


##  overdispersion  parameter  =2.1 
##  residual  sd  is  sqrt (overdispersion)  =1.46 


##  Call: 

##  pool( formula  =  y  ~  xl  +  x2  +  x3  +  x4  +  x5  +  x6  +  x7  +  x8  +  x9  + 
##  xl0j  data  =  imputations j  m  =  3) 

## 

##  Deviance  Residuals : 

##  Min  IQ  Median  3Q  Max 

##  -2.8821  -0.6925  -0.0005  0.6859  3.7035 
## 

##  Coefficients : 


## 

Estimate 

Std.  Error 

t  value  Pr( > It  1 ) 

##  (Intercept) 

0.76906 

0.83558 

0.920 

0.440149 

##  xl 

0.94250 

0.04535 

20.781 

0.000388  *** 

##  x2 

0.97495 

0.03517 

27. 721 

2.01e-05  *** 

##  x31 

-0.27349 

0.37377 

-0.732 

0.533696 

##  x4. L 

0.21116 

0.21051 

1.003 

0.378488 

##  x4.(? 

-0.08567 

0.15627 

-0.548 

0.602349 

##  x4. C 

0.02957 

0.24490 

0.121 

0.911557 

*  *  * 
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## 

x4A4 

0.24987 

0.19504 

1.281 

0.271639 

## 

x5b 

0.03327 

0.41563 

0.080 

0.940649 

## 

x5c 

-0.41124 

0.25525 

-1.611 

0.129304 

## 

x5d 

-0.21576 

0.86290 

-0.250 

0.824194 

## 

x5e 

0.11334 

0.56396 

0.201 

0.854842 

##  x5f 

-0.13162 

0.55187 

-0.238 

0.827734 

##  x5g 

-0.27014 

0.67022 

-0.403 

0.719913 

## 

x5h 

-0.16951 

0.66294 

-0.256 

0.818576 

## 

x5i 

-0.68619 

0.80975 

-0.847 

0.477639 

##  x5j 

0.20681 

0.27823 

0.743 

0.473891 

## 

x6 

-0.04009 

0.07306 

-0.549 

0.633836 

## 

x7 

0.98130 

0.08527 

11.508 

0.000197  * 

## 

x82 

0.43774 

0.38574 

1.135 

0.322775 

## 

x83 

0.40307 

0.20475 

1.969 

0. 049445  * 

## 

x84 

-0.13651 

0.62307 

-0.219 

0.843284 

## 

x85 

0.19905 

0.29973 

0.664 

0.528335 

## 

x86 

0.18662 

0.24702 

0.755 

0.452036 

## 

x87 

0.18792 

0.38029 

0.494 

0.647992 

## 

x88 

0.51106 

0.34272 

1.491 

0.192478 

## 

x89 

0.25125 

0.26340 

0.954 

0.356132 

## 

x810 

0.17383 

0.47841 

0.363 

0. 740434 

## 

x9 

0.87514 

0.71484 

1.224 

0.334593 

## 

## 

Xl0 

-0.05722 

0.05035 

-1.136 

0.331688 

## 

## 

Signif. 

codes:  0  '***' 

0.001  '** 

'  0.01 

0.05  ' 

##  (Dispersion  parameter  for  gaussian  family  taken  to  be  2.120095) 
## 

##  Null  deviance:  15851.5  on  999  degrees  of  freedom 
##  Residual  deviance:  2056.5  on  970  degrees  of  freedom 
##  AIC:  3616.9 

##  Number  of  Fisher  Scoring  iterations :  7 


#  Report  the  summaries  of  the  imputations 
data. frames  <-  complete (imputations j  3) 
lapply(data. frames j  summary) 


#  extract  the  first  3  chains 


##  $' chain :1 


## 

y 

xl 

x2 

x3 

x4 

## 

Min. 

-6.852 

Min. 

-3.697 

Min. 

-4.920 

0:545 

1:203 

## 

1st  Qu. 

2.475 

1st  Qu. 

0.  000 

1st  Qu. 

0. 000 

1:455 

2:189 

## 

Median 

5.470 

Median 

2.510 

Median 

1.801 

3:201 

## 

Mean 

5.458 

Mean 

2.556 

Mean 

2.314 

4:202 

## 

3rd  Qu. 

8.355 

3rd  Qu. 

4.892 

3rd  Qu. 

4.777 

5:205 

## 

Max. 

16.487 

Max. 

10.543 

Max. 

8.864 

## 

## 

x5 

x6 

x7 

x8 

## 

c 

•118  Min. 

-4. 

291  Min. 

-2. 

73138 

5 

:117 

## 

b 

■113  1st  Qu. 

3. 

000 

1st  Qu. 

-0. 

61765 

7 

:  109 

## 

d 

•Ill  Median 

5. 

000  Median 

0. 

00085 

3 

:  105 

## 

h 

108  Mean 

5. 

051  Mean 

0. 

01486 

1 

:  104 

## 

a 

105  3rd  Qu. 

7. 

000  3rd  Qu. 

0. 

71796 

2 

:  102 

## 

3 

99  Max. 

13. 

284  Max. 

3. 

71572 

10 

:  100 

## 

(Other) 

•346 

(Other) : 363 

## 

x9 

Xl0 

missing_ 

y 

missing_xl 

## 

Min. 

•0.0073 

Min. 

• -1.930 

Mode 

: Logical 

Mode  : 

logical 

## 

1st  Qu. 

•0.3383 

1st  Qu. 

2.115 

FALSE:  700 

FALSE: 

700 

## 

Median 

•0.5417 

Median 

4 . 000 

TRUE 

:  300 

TRUE  : 

300 
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##  Mean  : 0.547 6 
##  3rd  Qu. :0.7635 
##  Max.  : 0.9975 
## 

##  missing_x2 
##  Mode  : Logic a L 
##  FALSE: 700 
##  TRUE  : 300 
##  MTs  ;0 
## 

##  missing_x6 
##  Mode  : Logic a L 
##  FALSE: 700 
##  TRUE  : 300 
##  MTs  ;0 
## 

##  missing_xl0 
##  Mode  : Logic a L 
##  FALSE: 700 
##  TRUE  : 300 


## 

NA's  :0 

## 

##  $' chain: 2' 

## 

y 

## 

Min. 

-4. 

724 

## 

1st  Qu. 

2. 

587 

## 

Median 

5. 

669 

## 

Mean 

5. 

528 

## 

3rd  Qu. 

8. 

367 

## 

Max. 

17. 

054 

## 

##  x5 

##  c  ;114 

##  h  :  114 

##  b  : 109 

##  g  :104 

##  a  :103 

##  d  :102 

##  (Other): 354 
##  x9 

##  Min.  : 0.0132 
##  1st  Qu. :0. 3200 
##  Median  : 0.5269 
##  Mean  : 0.5357 
##  3rd  Qu. :0. 7612 
##  Max.  : 0.9954 
## 

##  missing_x2 
##  Mode  : Logic a L 
##  FALSE: 700 
##  :300 

##  MTs  :0 
## 

##  missing_x6 
##  Mode  : Logic a L 
##  FALSE: 700 
##  7/?0£  :300 

##  A/>4  's  :0 


Mean  :  3.870 
3rd  Qu. :  5. 000 
Max.  : 11.000 

missing_x3 
Mode  : Logic a L 
FALSE: 700 
TRUE  : 300 
NA's  :0 

missing_x7 
Mode  :LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 


xl 


Min. 

-4.744 

1st  Qu. 

0. 000 

Median 

2.282 

Mean 

2.486 

3rd  Qu. 

4.884 

Max. 

10.445 

x6 

Min.  : -1.498 
1st  Qu. :  3.000 
Median  :  5.000 
Mean  :  4 . 948 
3rd  Qu. :  7. 000 
Max.  : 12. 954 


x!0 


Min. 

-1.954 

1st  Qu. 

2.097 

Median 

4. 000 

Mean 

3.851 

3rd  Qu. 

5.000 

Max. 

11.000 

missing_x3 
Mode  :LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 

missing_x7 
Mode  :LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 


NA's  :0 


missing_x4 
Mode  :LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 

missing_x8 
Mode  :LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 


x2 


Min. 

-5.740 

1st  Qu. 

0. 000 

Median 

2.135 

Mean 

2.452 

3rd  Qu. 

4.782 

Max. 

10.932 

x7 


Min. 

-2.65008 

1st  Qu. 

-0.58182 

Median 

0.02262 

Mean 

0.03298 

3rd  Qu. 

0.71906 

Max. 

3.71572 

missing_y 
Mode  :LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 


missing_x4 
Mode  :LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 

missing_x8 
Mode  :LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 


NA's  :0 


missing_x5 
Mode  :LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 

missing_x9 
Mode  :LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 


x3  x4 

0:558  1:211 

1:442  2:193 

3:211 
4:187 
5:198 


x8 

3  : 123 

1  :110 

7  : 108 

5  : 106 

10  :101 

4  : 100 
(Other): 352 
missing_xl 
Mode  :LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 


missing_x5 
Mode  :LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 

missing_x9 
Mode  :LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 
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##  missing_xl0 
##  Mode  : Logic a L 
##  FALSE: 700 
##  TRUE  : 300 
##  NA's  :0 
## 

##  $'  chain: 3' 


## 

y 

xl 

x2 

x3 

x4 

## 

Min. 

-5.132 

Min.  : -8.769 

Min. 

-3.643 

0:538 

1:200 

## 

1st  Qu. 

2.414 

1st  Qu.:  0.000 

1st  Qu. 

0. 000 

1:462 

2:182 

## 

Median 

5.632 

Median  :  2.034 

Median 

2.610 

3:215 

## 

Mean 

5.537 

Mean  :  2.417 

Mean 

2.530 

4:211 

## 

3rd  Qu. 

8.434 

3rd  Qu. :  4.836 

3rd  Qu. 

4.812 

5:192 

## 

Max. 

16.945 

Max.  : 10. 335 

Max. 

11.683 

## 

## 


## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


x5 


x6 


x7 


x8 


## 

b 

:  123 

Min. 

-2.223 

Min. 

-2.76469 

2  : 139 

## 

3 

:115 

1st  Qu. 

3.000 

1st  Qu. 

-0.64886 

5  :111 

## 

c 

:111 

Median 

5.000 

Median 

0.03266 

1  :110 

## 

h 

:  103 

Mean 

4.957 

Mean 

0.03220 

3  : 109 

## 

i 

:  103 

3rd  Qu. 

7.000 

3rd  Qu. 

0.71341 

7  : 106 

## 

a 

:100 

Max. 

11.785 

Max. 

3.71572 

9  : 100 

## 

(Other) : 345 

(Other): 325 

## 

x9 

Xl0 

missing_y 

missing_xl 

Min.  : 0.0072 36  Min. 

1st  Qu.:0. 320579  1st  Qu. 

Median  : 0.531962  Median 

Mean  :0. 541147  Mean 

3rd  Qu. : 0.772802  3rd  Qu. 

Max.  : 0.992118  Max. 


-1.522 
2.224 
4. 000 
3.894 
5.000 
11.000 


Mode  : LogicaL 
FALSE:  700 
TRUE  : 300 
NA's  :0 


Mode  : LogicaL 
FALSE: 700 
TRUE  :300 
NA's  :0 


missing_x2 
Mode  : LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 

missing_x6 
Mode  : LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 

missing_xl0 
Mode  : LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 


missing_x3 
Mode  : LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 

missing_x7 
Mode  : LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 


missing_x4 
Mode  : LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 

missing_x8 
Mode  : LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 


missing_x5 
Mode  : LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 

missing_x9 
Mode  : LogicaL 
FALSE: 700 
TRUE  : 300 
NA's  :0 


97 


coef (summary (modeL_resuLts) )[ j  1:2]  #  get  the  model  coef's  and  their  SE's 

## 

##  (Intercept) 

##  xl 
##  x2 
##  x31 
##  x4. L 
##  x4. Q 
##  x4. C 


Estimate 
0. 76906403 
0.94250085 
0.97494755 
0.27348764 
0.21116072 
0.08566591 
0. 02957084 


Std.  Error 
0.83558319 
0.04535482 
0.03517050 
0.37377108 
0.21050885 
0.15626753 
0.24489775 
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##  x4A4 
##  x5b 
##  x5c 
##  x5d 
##  x5e 
##  x5/ 
##  x5g 
##  x5h 
##  x5i 
##  x5j 
##  x6 
##  x7 
##  xS2 
##  x83 
##  xS4 
##  x85 
##  xS6 
##  xS7 
##  x88 
##  x89 
##  xS10 
##  x9 
##  x!0 


8.24986739 

8.83327892 

8.41123612 

8.21576243 

8.11334262 

8.13161632 

8.27813537 

8.16951449 

8.68618715 

8.28681881 

8.84888539 

8.98138349 

8.43773548 

8.48386683 

8.13651311 

8.19985219 

8.18661681 

8.18792278 

8.51185644 

8.25124568 

8.17382882 

8.87514342 

8.85721584 


8.19583558 

8.41562668 

8.25525293 

8.86289626 

8.56396149 

8.55187362 

8.67821765 

8.66293826 

8.88974757 

8.27822872 

8.87385886 

8.88526896 

8.38574473 

8.28475889 

8.62386988 

8.29973411 

8.24782288 

8.38828648 

8.34272121 

8.26348827 

8.47841878 

8.71483581 

8.85835334 


Library (" Lattice") 

densitypLot(y  ~  xl  +  x2}  data=imputations) 


This  plot,  Fig.  3.21,  allows  us  to  compare  the  density  of  observed  data  and 
imputed  data-these  should  be  similar  (though  not  identical)  under  MAR 
assumptions. 


3.13.2  TB1  Data  Example 

Next,  we  will  see  an  example  using  the  traumatic  brain  injury  (TBI)  dataset. 

#  Load  the  (raw)  data  from  the  table  into  a  plain  text  file  "08_EpiBioSData_ 
Incomplete . csv" 

TBI_Data  <-  read . csv ( "https : //umich . instructure . com/ fiLes/728782/downLoad?do 


]A)nLoad_frd=l 

n  _ 

j  na. 

strings=c(""j 

",  "NA")) 

summary (TBI_Data ) 

## 

id 

age 

sex 

mechanism 

## 

Min. 

1.88 

Min.  : 16.88 

FemaLe:  9 

Bike_vs_Auto 

4 

## 

1st  Qu. 

12.25 

1st  Qu. :23.88 

MaLe  :37 

BLunt 

4 

## 

Median 

23.58 

Median  :33.88 

FaLL 

13 

## 

Mean 

23.58 

Mean  :36.89 

GSM 

2 

## 

3rd  Qu. 

34.75 

3rd  Qu. :47. 25 

MCA 

7 

## 

Max. 

46.88 

Max.  :83.88 

MVA 

18 

## 

Peds_vs_Auto 

6 

## 

fieLd.gcs 

er.gcs 

icu.gcs 

worst. gcs 

## 

Min. 

3  Min.  :  3.888 

Min.  :  8.888  Min.  :  8.8 

## 

1st  Qu. : 

3  1st  Qu. :  4.888 

1st  Qu.  :  3.1 

388  1st  Qu. :  3.8 
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Median  :  6.000 
Mean  :  6.378 
3rd  Qu. :  8. 000 
Max.  : 14.000 
NA's  :1 


##  Median  :  7 
##  Mean  :  8 
##  3rd  Qu.:12 
##  Max.  :15 
##  NA's  :  2 
##  X6m.gose 

##  Min.  : 2 . 000 
##  1st  Qu.: 3. 000 
##  Median  : 5.000 
##  Mean  :4.805 
##  3rd  Qu.: 6. 000 
##  Max.  : 8.000 
##  NA's  :  5 
##  surgery 

##  Min.  : 0.0000 
##  1st  Qu. :0. 0000 
##  Median  : 1.0000 
##  Mean  : 0.6304 
##  3rd  Qu. : 1.0000 
##  Max.  : 1.0000 
## 

##  acute. sz 

##  Min.  : 0.0000 
##  1st  Qu. :0. 0000 
##  Median  : 0.0000 
##  Mean  : 0.1739 
##  3rd  Qu. : 0.0000 
##  Max.  : 1.0000 
## 


Median  :  7. 500 
Mean  :  8.182 
3rd  Qu. : 12. 250 
Max.  : 15. 000 
NA's  :  2 

X2013 .gose 
Min.  : 2 . 000 
1st  Qu. : 5.000 
Median  :7 .000 
Mean  : 5 . 804 
3rd  Qu. :7. 000 
Max.  :8.000 


skull. fx 
Min.  : 0.0000 
1st  Qu. : 0.0000 
Median  : 1.0000 
Mean  : 0.6087 
3rd  Qu. : 1.0000 
Max.  : 1.0000 

min.hr 

Min. 

1st  Qu 
Median 
Mean 
3rd  Qu 
Max. 

NA's 
ever.sz 
Min.  : 0.000 
1st  Qu. :0. 000 
Median  : 1.000 
Mean  : 0.587 
3rd  Qu. : 1.000 
Max.  : 1.000 


Median  :  3.0 
Mean  :  5.4 
3rd  Qu. :  7. 0 
Max.  :14.0 
NA's  :1 

temp. injury 
Min.  : 0.000 
1st  Qu. :0. 000 
Median  : 1.000 
Mean  : 0.587 
3rd  Qu. : 1.000 
Max.  : 1.000 

max. hr 


spikes.hr 


Min. 

1.280 

1st  Qu. 

5.357 

Median 

18.170 

Mean 

52.872 

3rd  Qu. 

57.227 

Max. 

294.000 

NA's 

18 

late 

sz 

Min. 

0 . 0000 

1st  Qu. 

0 . 0000 

Median 

1 . 0000 

Mean 

0.5652 

3rd  Qu. 

1 . 0000 

Max. 

1 . 0000 

0. 

000 

Min. 

12. 

00 

0. 

000 

1st  Qu. 

35. 

25 

0. 

000 

Median 

97. 

50 

3. 

571 

Mean 

241. 

89 

0. 

000 

3rd  Qu. 

312. 

75 

42. 

000 

Max. 

1199. 

00 

18 

NA's 

18 

1.  Convert  to  a  missing_data  .  frame  (Fig.  3.22) 


Dark  represents  missing  data 


CD 

-Q 

£ 

3 


C 


O 

"-I— * 

03 

> 

CD 

C/3 

-Q 


o 


CD  X 
O)  CD 
CO  CO 


CCOCOCOCOCDCOX 
•^  0  0  0  0(0 
w  03  03  O)  O)  o 


c 

CO  "O 


=  CD 


=3 

O 


CO 


03 


O 

CD 

£ 


CO  -5 
TZ  -X 

£  o  CO 


CD 

X 


C*  CD 


o>jg.E 


CM 

X 


c 

£  co  co 


E 


)=  N  N  N 
CO  CO  CO 

><  CD  CD 
CO 

CO 


EO  CD 

o  — 

CO 


CD 

> 

CD 


-  1 .500000e+00 

-  1 .000000e+00 

-  5.000000e-01 
k  -6.666667e-09 
L  -5.000000e-01 
L  -1 .000000e+00 


Standardized  Variable 

Clustered  by  missingness 


Fig.  3.22  Missing  data  pattern  for  the  TBI  case- study 
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#  Get  information  matrix  of  the  data 

#  library( "betareg" ) ;  library( "mi" ) 

mdf  <-  missing_data. frame (TBI_Data)  #  warnings  about  missingness  patterns 

##  NOTE:  The  foL Lowing  pairs  of  variables  appear  to  have  the  same  missingnes 
s  pattern. 

##  Please  verify  whether  they  are  in  fact  Logically  distinct  variables . 

##  [,1]  [,2] 

##  [lj]  "icu.gcs"  "worst . gcs" 

show (mdf) ;  mdf@patterns;  image (mdf ) 

##  Object  of  class  missing_data . frame  with  46  observations  on  19  variables 
## 

##  There  are  7  missing  data  patterns 
## 

##  Append  patterns'  to  this  missing_data. frame  to  access  the  corresponding 
pattern  for  every  observation  or  perhaps  use  table() 

## 


## 

type  missing  method 

model 

## 

id 

irrelevant 

0  <NA> 

<NA> 

## 

age 

continuous 

0  <NA> 

<NA> 

## 

sex 

binary 

0  <NA> 

<NA> 

## 

mechanism 

unordered -categorical 

0  <NA> 

<NA> 

##  field,  gcs 

continuous 

2  ppd 

linear 

## 

er.gcs 

continuous 

2  ppd 

Linear 

## 

icu.gcs 

continuous 

1  ppd 

Linear 

## 

worst. gcs 

continuous 

1  ppd 

Linear 

## 

X6m.gose 

continuous 

5  ppd 

Linear 

## 

X2013 .gose 

continuous 

0  <NA> 

<NA> 

## 

skull.fx 

binary 

0  <NA> 

<NA> 

##  temp. injury 

binary 

0  <NA> 

<NA> 

## 

surgery 

binary 

0  <NA> 

<NA> 

## 

spikes.hr 

continuous 

18  ppd 

Linear 

## 

min.hr 

continuous 

18  ppd 

linear 

## 

max. hr 

continuous 

18  ppd 

linear 

## 

acute. sz 

binary 

0  <NA> 

<NA> 

## 

late.sz 

binary 

0  <NA> 

<NA> 

## 

ever.sz 

binary 

0  <NA> 

<NA> 

## 

## 

fami Ly 

Link  transformation 

## 

id 

<NA> 

<NA> 

<NA> 

## 

age 

<NA> 

<NA> 

standardize 

## 

sex 

<NA> 

<NA> 

<NA> 

## 

mechanism 

<NA> 

<NA> 

<NA> 

##  field. gcs 

gaussian 

identity 

standardize 

## 

er.gcs 

gaussian 

identity 

standardize 

## 

icu.gcs 

gaussian 

identity 

standardize 

## 

worst. gcs 

gaussian 

identity 

standardize 

## 

X6m.gose 

gaussian 

identity 

standardize 

## 

X2013 .gose 

<NA> 

<NA> 

standardize 

## 

skull.fx 

<NA> 

<NA> 

<NA> 

##  temp. injury 

<NA> 

<NA> 

<NA> 

## 

surgery 

<NA> 

<NA> 

<NA> 

## 

spikes.hr 

gaussian 

identity 

standardize 

## 

min.hr 

gaussian 

identity 

standardize 

## 

max. hr 

gaussian 

identity 

standardize 

## 

acute. sz 

<NA> 

<NA> 

<NA> 

## 

Late.sz 

<NA> 

<NA> 

<NA> 
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##  ever.sz 


<NA> 


<NA> 


<NA> 


## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


## 

## 

## 

## 

## 


[3] 

[4] 


[7] 

[8] 

[9] 

[10] 


[1]  spikes .hr , 
[27  field. gcs 
nothing 
nothing 

[5]  nothing 

[6]  nothing 
spikes.hr, 
nothing 
nothing 
nothing 

##  [117  nothing 
##  [127  nothing 
##  [137  spikes . hr j 
##  /"14J  nothing 
##  [15]  spikes . hr j 
##  /"16J  spikes .  hr  j 
##  [17]  nothing 
##  [18]  spikes.hr, 
##  [19]  spikes.hr, 
##  /"20J  spikes.hr. 


min.hr,  max.hr 


min.hr ,  max.hr 


min.hr ,  max.hr 


min.hr, 

min.hr, 

min.hr , 
min.hr , 
min.hr. 


max. hr 
max. hr 

max. hr 
max. hr 
max. hr 


[23] 

[24] 

[25] 

[26] 
[27] 


##  /"217  X6m.gose,  spikes.hr,  min.hr ,  max.hr 
##  /"227  nothing 

spikes.hr,  min.hr,  max.hr 
spikes.hr,  min.hr ,  max.hr 
spikes.hr,  min.hr ,  max.hr 
spikes.hr,  min.hr ,  max.hr 
spikes.hr,  min.hr,  max.hr 
##  [28]  X6m. gose 

##  [29]  spikes.hr,  min.hr,  max.hr 
##  [30 7  nothing 

##  [31]  X6m.gose,  spikes.hr,  min.hr ,  max.hr 
##  /"327  spikes.hr,  min.hr ,  max.hr 
##  [33]  nothing 
##  /"347  nothing 
##  [35]  nothing 
##  [36]  nothing 
##  [377  field. gcs, 

##  [38]  er.gcs 
##  /"397  nothing 
##  [407  nothing 
##  /"417  nothing 
##  /"427  spikes.hr. 


er.gcs,  icu.gcs,  iA/orst .  gcs,  X6m.gose 


min.hr,  max.hr 


##  /"437  nothing 
##  /"447  nothing 
##  /"457  nothing 
##  [467  X6m. gose 

##  7  Levels:  nothing  field. gcs  X6m.gose  er.gcs 
s,  worst .gcs,  X6m.gose 


field. gcs,  er.gcs,  icu.gc 


2.  Configuring  the  imputation  process . 

#  mi::change()  method  changes  the  family  imputation  method, 

#  size,  type,  and  so  forth  of  a  missing  variable.  It's  called 

#  before  calling  mi  to  affect  how  the  conditional  expectation  of  each 

#  missing  variable  is  modeled. 

mdf  <-  change (mdf,  y  =  "spikes.hr",  what  =  "transformation",  to=" identity " ) 
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3.  Examine  the  missingness  patterns . 

summary(mdf);  hist(mdf); 


## 

id 

age 

sex 

mechanism 

## 

Min. 

1.00 

Min.  : 16.00 

Female:  9 

Bike_ 

vs_Auto 

4 

## 

1st  Qu. 

12.25 

1st  Qu. : 23.00 

Male  :37 

Blunt 

4 

## 

Median 

23.50 

Median  : 33.00 

Fall 

13 

## 

Mean 

23.50 

Mean  :36.89 

GSIaI 

2 

## 

3rd  Qu. 

34.75 

3rd  Qu. :47 . 25 

MCA 

7 

## 

Max. 

46.00 

Max.  :83.00 

MVA 

10 

## 

Peds_ 

vs_Auto 

6 

## 

field. 

gcs 

er.gcs 

icu.gcs 

lAjorst.gcs 

## 

## 

## 

## 

## 

## 


Min. 

1st  Qu. 
Median 
Mean 
3rd  Qu. 
Max. 


3 

3 

7 

8 

12 

15 


Min. 

1st  Qu. 
Median 
Mean 
3rd  Qu. 
Max. 


3.000 
4. 000 
7.500 
8.182 
12.250 
15.000 


Min. 

1st  Qu. 
Median 
Mean 
3rd  Qu. 
Max. 


0. 000 
3.000 
6 . 000 
6.378 
8.000 
14.000 


Min. 

1st  Qu. 
Median 
Mean 
3rd  Qu. 
Max. 


0.0 

3.0 

3.0 

5.4 

7.0 

14.0 


## 

AM's  :2  AM's  :2 

NA's  :1 

NA's  :1 

## 

X6m.gose 

X2013.gose 

skull. fx 

temp. injury 

## 

Min.  : 2.000 

Min.  : 2.000 

Min.  : 0.0000 

Min.  : 0.000 

## 

1st  Qu. : 3.000 

1st  Qu. : 5.000 

1st  Qu. : 0.0000 

1st  Qu. :0. 000 

## 

Median  : 5.000 

Median  :7 .000 

Median  : 1.0000 

Median  : 1.000 

## 

Mean  :4.805 

Mean  : 5 . 804 

Mean  : 0.6087 

Mean  : 0.587 

## 

3rd  Qu. :6.000 

3rd  Qu. :7. 000 

3rd  Qu. : 1.0000 

3rd  Qu. : 1.000 

## 

Max.  : 8.000 

Max.  :8.000 

Max.  : 1.0000 

Max.  : 1.000 

## 

AM's  :  5 

## 

surgery 

spikes.hr 

min. 

hr 

max. 

hr 

## 

Min.  : 0.0000 

Min. 

1.280  Min. 

0.000  Min. 

12.00 

## 

1st  Qu. : 0.0000 

1st  Qu. 

5.357  1st  Qu. 

0.000  1st  Qu. 

35.25 

## 

Median  : 1.0000 

Median 

18.170  Median 

0.000  Median 

97.50 

## 

Mean  : 0.6304 

Mean 

52.872  Mean 

3.571  Mean 

241 . 89 

## 

3rd  Qu. : 1.0000 

3rd  Qu. 

57.227  3rd  Qu. 

0.000  3rd  Qu. 

312.75 

## 

Max.  : 1.0000 

Max. 

294.000  Max. 

42.000  Max. 

1199.00 

## 

NA's 

18 

NA's 

18 

NA's 

18 

## 

acute. sz 

late 

sz 

ever.sz 

## 

Min.  : 0.0000 

Min. 

0. 0000 

Min.  : 0.000 

## 

1st  Qu. : 0.0000 

1st  Qu. 

0. 0000 

1st  Qu. :0. 000 

## 

Median  : 0.0000 

Median 

1 . 0000 

Median  : 1.000 

## 

Mean  : 0.1739 

Mean 

0.5652 

Mean  : 0.587 

## 

3rd  Qu. :0. 0000 

3rd  Qu. 

1 . 0000 

3rd  Qu. : 1.000 

## 

Max.  : 1.0000 

Max. 

1 . 0000 

Max.  : 1.000 

## 


image  (mdf) 


4.  Perform  the  initial  imputation  (Fig.  3.23). 

imputations  <-  mi (mdf ,  n .iter=10j  n . chains=5}  verbose=TRUE) 
hist( imputations ) 


3.13 
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5.  Extracts  several  multiply  imputed  data . frames  from  the 
" imputations "  object. 

data. frames  <-  complete (imputations ,  5) 


field, gcs  (standardise) 
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icu.gcs  (standardise) 
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Fig.  3.23  Validation  plots  for  the  original,  imputed  and  complete  TBI  datasets 
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fieldgcs  {standardize) 


-2-10  1  2  3 
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Fig.  3.23  (continued) 


max.hr  (standardize) 


Completed 


6.  Report  a  list  of  "summaries"  for  each  element  ( imputation 
instance) . 


LappLy(  data. frames } 

summary) 

##  $' chain : 1' 

## 

id 

age 

sex 

mechanism 

## 

Min.  :  1.00 

Min.  : 16.00 

Female:  9 

Bike_ 

vs_Auto 

4 

## 

1st  Qu. :12.25 

1st  Qu. : 23.00 

Male  :37 

Blunt 

4 

## 

Median  :23.50 

Median  : 33.00 

Fall 

13 

## 

Mean  :23.50 

Mean  :36.89 

GSIaI 

2 

## 

3rd  Qu. : 34.75 

3rd  Qu.  :47 . 25 

MCA 

7 

## 

Max.  :46.00 

Max.  :83.00 

MVA 

10 

## 

Peds_ 

vs_Auto 

6 

## 

field. gcs 

er.gcs 

icu.gcs 

worst. 

gcs 

## 

Min.  : -3.424 

Min. 

3.000 

Min. 

0. 000 

Min. 

0. 000 

## 

1st  Qu. :  3.000 

1st  Qu. 

4.250 

1st  Qu. 

3.000 

1st  Qu. 

3.000 

## 

Median  :  6.500 

Median 

8.000 

Median 

6 . 000 

Median 

3.000 

## 

Mean  :  7. 593 

Mean 

8.442 

Mean 

6.285 

Mean 

5.494 

## 

3rd  Qu. : 12. 000 

3rd  Qu. 

13.000 

3rd  Qu. 

7.750 

3rd  Qu. 

7.750 

## 

Max.  : 15. 000 

Max. 

15.000 

Max. 

14.000 

Max. 

14.000 

## 

## 

X6m. gose 

X2013 .gose 

skull. fx  temp. injury  surgery 

## 

Min.  : 2 . 000 

Min.  : 2 . 000 

0:18  0:19 

0:17 

## 

1st  Qu. : 3.000 

1st  Qu. : 5.000 

1:28  1:27 

1:29 

## 

Median  : 5.000 

Median  :7 .000 

## 

Mean  : 5.031 

Mean  : 5 . 804 

## 

3rd  Qu. :6.815 

3rd  Qu. : 7. 000 

## 

Max.  :8.169 

Max.  : 8.000 

## 

3.13  Missing  Data 
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##  spikes.hr 


min.hr 


## 

Min. 

-86.914 

Min. 

-11.697 

Min. 

## 

1st  Qu. 

3.953 

1st  Qu. 

0. 000 

1st  Qu. 

## 

Median 

28.125 

Median 

0.000 

Median 

## 

Mean 

59.108 

Mean 

7.133 

Mean 

## 

3rd  Qu. 

113.615 

3rd  Qu. 

11.329 

3rd  Qu. 

## 

Max. 

294.000 

Max. 

43.706 

Max. 

max. hr 

-153.94 
42.25 
211.49 
282.79 
390.63 
1199.00 


acute. sz 
0:38 
1:  8 


Late.sz 

0:20 

1:26 


ever.sz  missing_fieLd . gcs  missing_er.gcs 


0:19 

1:27 


missing_worst . gcs  missing_X6m.gose  missing_spikes.hr  missing_min.hr 


## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

##  $'  chain: 2 


Mode  '.Logical 
FALSE : 44 
TRUE  : 2 
NA's  :0 


Mode  : logical 
FALSE : 44 
TRUE  : 2 
NA's  :0 


missing_icu . gcs 
Mode  : logical 
FALSE : 45 
TRUE  :1 
NA's  :0 


Mode  : logical 
FALSE: 45 
TRUE  :1 
NA's  :0 

missing_max. hr 
Mode  : Logical 
FALSE: 28 
TRUE  :18 
NA's  :0 


Mode  : logical 
FALSE : 41 
TRUE  : 5 
NA's  :0 


Mode  : logical 
FALSE: 28 
TRUE  :18 
NA's  :0 


Mode  '.logical 
FALSE: 28 
TRUE  :18 
NA's  :0 


## 


id 


age 


sex 


mechanism 


## 

Min. 

'1.00 

Min.  : 16.00 

Female:  9 

Bike_ 

vs_Auto 

4 

## 

1st  Qu. 

'12.25 

1st  Qu. : 23.00 

Male  :37 

Blunt 

4 

## 

Median 

'23.50 

Median  : 33.00 

Fall 

13 

## 

Mean 

■23.50 

Mean  :36.89 

GSIaI 

2 

## 

3rd  Qu. 

■34.75 

3rd  Qu.  :47.25 

MCA 

7 

## 

Max. 

■46.00 

Max.  :83.00 

MVA 

10 

## 

Peds_ 

vs_Auto 

6 

## 

field. 

gcs 

er.gcs 

icu.gcs 

worst. 

gcs 

## 

Min. 

-3.324 

Min. 

3.000 

Min. 

0. 00 

Min. 

0. 000 

## 

1st  Qu. 

3.000 

1st  Qu. 

4.000 

1st  Qu. 

3.00 

1st  Qu. 

3.000 

## 

Median 

6.500 

Median 

7.000 

Median 

6.00 

Median 

3.000 

## 

Mean 

7.658 

Mean 

8.046 

Mean 

6.24 

Mean 

5.466 

## 

3rd  Qu. 

12.000 

3rd  Qu. 

12.000 

3rd  Qu. 

7.75 

3rd  Qu. 

7.750 

## 

Max. 

15.000 

Max. 

15.000 

Max. 

14.00 

Max. 

14.000 

## 

## 

X6m.gose 

X2013. 

gose 

skull. fx  temp. injury  surgery 

## 

Min. 

-3.196 

Min. 

■2.000 

0:18 

0:19 

0:17 

## 

1st  Qu. 

3.000 

1st  Qu. 

■5.000 

1:28 

1:27 

1:29 

## 

Median 

5.000 

Median 

■7.000 

## 

Mean 

4.755 

Mean 

■5.804 

## 

3rd  Qu. 

6.117 

3rd  Qu. 

■7.000 

## 

Max. 

8.000 

Max. 

■8.000 

## 

## 

## 

## 

## 

## 

## 


spikes.hr 

min. 

hr 

max. 

hr 

acute. sz 

Late.sz 

Min. 

-138.854 

Min. 

-30.603 

Min. 

-432.95 

0:38 

0:20 

1st  Qu. 

5.518 

1st  Qu. 

0. 000 

1st  Qu. 

28.75 

1:  8 

1:26 

Median 

34.522 

Median 

0. 000 

Median 

97.50 

Mean 

61.310 

Mean 

2.329 

Mean 

209.53 

3rd  Qu. 

97.394 

3rd  Qu. 

4.306 

3rd  Qu. 

306.98 
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Max. 


ever.sz  missing_fieLd . gcs 


0:19 

1:27 


missing_worst .  gcs  missing_X6m.gose  m-iss-ing_spikes.hr  missing_min.hr 


## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

##  $' chain: 3 


294.000  Max. 


Mode  : Logic a L 
FALSE : 44 
TRUE  : 2 
NA's  :0 


:  42.000  Max. 

missing_er .gcs 
Mode  :LogicaL 
FALSE : 44 
TRUE  : 2 
NA's  :0 


:  1336. 12 

missing_icu.gcs 
Mode  :LogicaL 
FALSE : 45 
TRUE  :1 
NA's  :0 


Mode  :LogicaL 
FALSE : 45 
TRUE  :1 
NA's  :0 

missing_max. hr 
Mode  :LogicaL 
FALSE: 28 
TRUE  :18 
NA's  :0 


Mode  :LogicaL 
FALSE : 41 
TRUE  : 5 
NA's  :0 


Mode  : Logic a L 
FALSE: 28 
TRUE  :18 
NA's  :0 


Mode  :LogicaL 
FALSE: 28 
TRUE  :18 
NA's  :0 


## 


id 


age 


sex 


mechanism 


## 

Min. 

1.00 

Min.  :16.00 

FemaLe:  9 

Bike_vs_Auto 

4 

## 

1st  Qu. 

12.25 

1st  Qu. : 23.00 

MaLe  :37 

BLunt 

4 

## 

Median 

23.50 

Median  : 33.00 

FaLL 

13 

## 

Mean 

23.50 

Mean  :36.89 

GSIaI 

2 

## 

3rd  Qu. 

34.75 

3rd  Qu. :47. 25 

MCA 

7 

## 

Max. 

46.00 

Max.  :83.00 

MVA 

10 

## 

Peds_vs_Auto 

6 

## 

fieLd. 

gcs 

er.gcs 

icu.gcs 

worst. 

gcs 

## 

Min. 

3.000 

Min. 

3.000 

Min. 

0.000 

Min. 

0. 

## 

1st  Qu. 

3.250 

1st  Qu. 

4.191 

1st  Qu. 

3.000 

1st  Qu. 

3. 

## 

Median 

7.000 

Median 

7.500 

Median 

6 . 000 

Median 

3. 

## 

Mean 

8.218 

Mean 

8.513 

Mean 

6.325 

Mean 

5. 

## 

3rd  Qu. 

12.000 

3rd  Qu. 

12.750 

3rd  Qu. 

7.750 

3rd  Qu. 

7. 

## 

Max. 

17.978 

Max. 

26.831 

Max. 

14.000 

Max. 

14. 

## 

## 

X6m.gose 

X2013 .gose 

skuLL.fx  temp. injury  surgery 

## 

Min. 

■2.000 

Min.  : 2.000 

0:18  0:19 

0:17 

## 

1st  Qu. 

■3.000 

1st  Qu. : 5. 000 

1:28  1:27 

1:29 

## 

Median 

■5.000 

Median  :7 .000 

## 

Mean 

■4.892 

Mean  : 5 . 804 

## 

3rd  Qu. 

6 . 000 

3rd  Qu. :7. 000 

## 

Max. 

■8.000 

Max.  :8.000 

## 

## 

spikes.hr 

min.hr 

max. hr 

acute. 

sz 

## 

Min. 

-40.459 

Min. 

: -27. 222  Min. 

: -236.6 

0:38 

## 

1st  Qu. 

5.518 

1st  Qu. 

:  0.000  1st  Qu. :  37.5 

1:  8 

## 

## 

## 

## 

## 

## 

## 

## 

## 


Median 
Mean 
3rd  Qu. 
Max. 


34.864 

65.781 

100.137 

294.000 


Median 
Mean 
3rd  Qu. 
Max. 


0.000 

2.619 

5.681 

42.000 


Median 
Mean 
3rd  Qu. 
Max. 


Late.sz 

0:20 

1:26 


195.6 

281.8 

476.1 

1199.0 


ever.sz  missing_fieLd . gcs  missing_er. gcs 
0:19  Mode  :LogicaL  Mode  :LogicaL 

1:27  FALSE :44  FALSE :44 

TRUE  : 2  TRUE  :2 


missing_icu.gcs 
Mode  :LogicaL 
FALSE : 45 
TRUE  :1 


3.13 
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## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


MTs  :0  /V/Ts  :0  MTs  :0 

missing_\Ajorst .  gcs  missing_X6m.gose  missing_spikes.hr  missing_min.hr 


Mode  : Logical 

Mode  : Logical 

Mode  : Logical 

Mode  : Logical 

FALSE : 45 

FALSE : 41 

FALSE: 28 

FALSE: 28 

TRUE  :1 

TRUE  : 5 

TRUE  :18 

TRUE  :18 

NA's  :0 

NA's  :0 

NA's  :0 

NA's  :0 

missing_max. hr 
Mode  : Logical 
FALSE: 28 

TRUE  :18 

NA's  :0 

##  $' chain: 4 


## 

id 

age 

sex 

mechanism 

## 

Min. 

1.00 

Min.  : 16.00 

Female:  9 

Bike_ 

vs_Auto 

4 

## 

1st  Qu. 

12.25 

1st  Qu. : 23.00 

Male  :37 

Blunt 

4 

## 

Median 

23.50 

Median  : 33.00 

Fall 

13 

## 

Mean 

23.50 

Mean  :36.89 

GSIaI 

2 

## 

3rd  Qu. 

34.75 

3rd  Qu. :47 . 25 

MCA 

7 

## 

Max. 

46.00 

Max.  :83.00 

MVA 

10 

## 

Peds_ 

vs_Auto 

6 

## 

field. 

gcs 

er.gcs 

icu.gcs 

lAjorst.gcs 

## 

Min. 

3.000 

Min.  : -7.960 

Min. 

-1.610 

Min. 

: -4.339 

## 

1st  Qu. 

3.250 

1st  Qu. :  4.000 

1st  Qu. 

3.000 

1st  Qu. :  3.000 

## 

Median 

7.000 

Median  :  7. 000 

Median 

6 . 000 

Median  :  3.000 

## 

Mean 

8.001 

Mean  :  7. 746 

Mean 

6.204 

Mean 

:  5.188 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


3rd  Qu. 
Max. 


12.000 

15.000 


3rd  Qu. 
Max. 


12.000 

15.000 


3rd  Qu. 
Max. 


7.750 

14.000 


3rd  Qu. 
Max. 


7.000 

14.000 


X6m. gose 
Min.  : 2.000 
1st  Qu. : 3.000 
Median  : 5.000 
Mean  : 5 . 095 
3rd  Qu. :6. 997 
Max.  :8.930 


X2013 .gose 
Min.  : 2 . 000 
1st  Qu. : 5.000 
Median  :7 .000 
Mean  : 5 . 804 
3rd  Qu. :7. 000 
Max.  : 8.000 


skuLL.fx  temp. injury  surgery 


0:18 

1:28 


0:19 

1:27 


0:17 

1:29 


## 

spikes.hr 

min. 

hr 

max. 

hr 

acute. sz 

Late.sz 

## 

Min. 

-106.439 

Min. 

-28.5276 

Min. 

-536.00 

0:38 

0:20 

## 

1st  Qu. 

4.577 

1st  Qu. 

0.0000 

1st  Qu. 

32.21 

1:  8 

1:26 

## 

Median 

29.593 

Median 

0. 0000 

Median 

98.15 

## 

Mean 

52.290 

Mean 

-0.1032 

Mean 

197.81 

## 

3rd  Qu. 

84.794 

3rd  Qu. 

0.1667 

3rd  Qu. 

333.46 

## 

Max. 

294.000 

Max. 

42.0000 

Max. 

1199.00 

ever.sz  missing_fieid . gcs 
0:19  Mode  : Logical 
1:27  FALSE: 44 
TRUE  : 2 
NA's  :0 


missing_er.gcs 
Mode  '.Logical 
FALSE: 44 
TRUE  : 2 
NA's  :0 


missing_icu.gcs 
Mode  : Logical 
FALSE : 45 
TRUE  :1 
NA's  :0 


missing_iAjorst .  gcs  missing_X6m .  gose  missing_spikes.hr  missing_min.hr 
Mode  : Logical  Mode  : Logical  Mode  : Logical  Mode  : Logical 

FALSE: 45  FALSE: 41  FALSE: 28  FALSE: 28 

TRUE  :1  TRUE  :5  TRUE  :18  TRUE  :18 
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## 
## 
## 
## 
## 
## 
## 
## 

##  # 
## 


## 


## 

## 


## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


/V/Ts  :0 

missing_mox. hr 
Mode  : Logical 
FALSE: 28 
TRUE  :18 
NA's  :0 

chain: 5' 
id 


NA's  :0 


NA's  :0 


NA's  :0 


age 


sex 


mechanism 


## 

Min. 

:  1.00 

Min. 

:16.00 

Female:  9 

Bike_vs_Auto 

4 

## 

1st  Qu. 

:12.25 

1st  Qu. : 23.00 

Male  :37 

Blunt 

4 

## 

Median 

:23. 50 

Median  : 33.00 

Fall 

13 

## 

Mean 

:23.50 

Mean 

:  36.89 

GSIaI 

2 

## 

3rd  Qu. 

:  34 .  75 

3rd  Qu. :47. 25 

MCA 

7 

## 

Max. 

:46. 00 

Max. 

: 83 . 00 

MVA 

10 

Peds  vs  Auto 


## 

field. 

gcs 

er.gcs 

icu.gcs 

worst. 

gcs 

## 

Min. 

3.000 

Min. 

-15.73 

Min. 

0.000 

Min. 

-2.742 

## 

1st  Qu. 

3.250 

1st  Qu. 

4.00 

1st  Qu. 

3.000 

1st  Qu. 

3.000 

## 

Median 

7.000 

Median 

7.32 

Median 

6 . 000 

Median 

3.000 

## 

Mean 

8.473 

Mean 

7.65 

Mean 

6.439 

Mean 

5.223 

## 

3rd  Qu. 

12.750 

3rd  Qu. 

12.00 

3rd  Qu. 

8.000 

3rd  Qu. 

7.000 

## 

Max. 

20.172 

Max. 

15.00 

Max. 

14.000 

Max. 

14.000 

X6m.gose 


X2013 .gose  skuLL.fx  temp. injury  surgery 


## 

Min. 

2.000 

Min.  : 2.000  0 

:18  0. 

■19 

0:17 

## 

1st  Qu. 

3.000 

1st  Qu. : 5. 000  1 

:28  1. 

27 

1:29 

## 

Median 

5.000 

Median  :7.000 

## 

Mean 

4.972 

Mean  : 5 . 804 

## 

3rd  Qu. 

6 . 000 

3rd  Qu. :7. 000 

## 

Max. 

11.481 

Max.  :8.000 

## 

## 

spikes.hr 

min. 

hr 

max. 

hr 

acute. sz 

Late.sz 

## 

Min. 

-74.552 

Min. 

-11.877 

Min. 

-570.4 

0:38 

0:20 

## 

1st  Qu. 

5.518 

1st  Qu. 

-1.924 

1st  Qu. 

37.5 

1:  8 

1:26 

## 

Median 

32.297 

Median 

0. 000 

Median 

175.3 

## 

Mean 

54.268 

Mean 

1.022 

Mean 

253.7 

## 

3rd  Qu. 

71 . 288 

3rd  Qu. 

0.  000 

3rd  Qu. 

432.1 

## 

Max. 

294.000 

Max. 

42.000 

Max. 

1199.0 

ever.sz  missing_fieLd . gcs 
0:19  Mode  : Logical 
1:27  FALSE: 44 
TRUE  : 2 
NA's  :0 


missing_er.gcs 
Mode  : Logical 
FALSE: 44 
TRUE  : 2 
NA's  :0 


missing_icu.gcs 
Mode  : Logical 
FALSE : 45 
TRUE  :1 
NA's  :0 


missing_worst . gcs  missing_X6m.gose  missing_spikes.hr  missing_min.hr 


Mode  : Logical 
FALSE : 45 
TRUE  :1 
NA's  :0 

missing_max. hr 
Mode  : Logical 
FALSE: 28 
TRUE  :18 
NA's  :0 


Mode  : Logical  Mode  : Logical 


FALSE : 41 
TRUE  : 5 
NA's  :0 


FALSE: 28 
TRUE  :18 
NA's  :0 


Mode  : Logical 
FALSE: 28 
TRUE  :18 
NA's  :0 
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7.  Cast  the  imputed  numbers  as  integers . 

#  (not  necessary,  but  may  be  useful) 

indx  <-  sappLy(data. frames [ [5] ] ,  is . numeric)  #  get  the  indices  of 
numeric  columns 

data. frames [ [5] ] [indx]  <-  LappLy(data. frames [ [5] ] [indx] ,  function(x) 
as. numeric (as. integer (x) ) ) 

#  cast  each  value  as  integer  data .frames [ [5] ]$spikes . hr 

8.  Save  results  out . 

write. csv (data, frames [ [5] j,  "C: \ \Users\\User\ \Desktop\\TBI_MIData . csv") 


9.  Complete  Data  analytics  functions . 

#  library( "mi" ) 

#  lm.mi();  glm.mi();  polr.miQ;  bayesglm.mi( ) ;  bayespolr .mi( ) ;  lmer.mi();  gl 
mer .mi( ) 


10.  Fit  a  linear  model  for  one  multiply  imputed  chain  . 


#  Also  see  Step  (9) 

##Linear  regression  for  each  imputed  data  set  -  5  regression  modeLs  are  fit 
fit_Lml  <-  gLm(ever. sz  ~  surgery  +  worst. gcs  +  factor(sex)  +  agej  data. frame 
s$' chain : 1' j  family  =  "binomial" ) ;  summary (fit _lml) ;  display (fit_Lml) 


## 


##  Call: 


##  glm( formula  =  ever.sz  ~  surgery  +  worst. gcs  +  factor(sex)  +  agej 
##  family  =  "binomial" ,  data  =  data. frames$' chain :1' ) 

## 

##  Deviance  Residuals : 

##  Min  IQ  Median  3Q  Max 

##  -1.7000  -1.2166  0.8222  1.0007  1.3871 
## 


##  Coefficients : 

##  Estimate 


##  (Intercept)  0.249780 
##  surgeryl  0.947392 
##  worst. gcs  -0.068734 
##  factor(sex)Male  -0.329313 
##  age  0.004453 
## 


Std.  Error 
1.356397 
0.685196 
0.097962 
0.842761 
0.019431 


z  value  Pr( >\z\) 

0.184 

0.854 

1.383 

0.167 

-0.702 

0.483 

-0.391 

0.696 

0.229 

0.819 

##  (Dispersion  parameter  for  binomial  family  taken  to  be  1) 
## 

##  Null  deviance:  62.371  on  45  degrees  of  freedom 
##  Residual  deviance:  60.046  on  41  degrees  of  freedom 
##  AIC:  70.046 
## 


##  Number  of  Fisher  Scoring  iterations :  4 
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##  gLm(formuLa  =  ever.sz  ~  surgery  +  worst. gcs  +  factor(sex)  +  age} 
##  famiLy  =  "binomial" j  data  =  data. frame s$' chain:!' ) 


## 

coef.est  coef.. 

##  (Intercept) 

0.25 

1.36 

##  surgery 1 

0.95 

0.69 

##  worst. gcs 

-0.07 

0.10 

##  factor(sex)Male 

-0.33 

0.84 

##  age 

0. 00 

0.02 

##  --- 

##  n  =  46 ,  k  =  5 

##  residual  deviance  =  60. 0,  null  deviance  =  62.4  (difference  =  2.3) 


11.  Fit  the  appropriate  model  and  pool  the  results. 


#  (estimates  oven  NI  chains) 

model_results  <-  pool(ever. sz  ~  surgery  +  worst. gcs  +  factor(sex)  +  agej 
famiLy  =  "binomial" j  data=imputationSj  m=5) 
display  (model_results) ;  summary  (model_results) 


##  bayesglm( formula  =  ever.sz  ~  surgery  +  worst. gcs  +  factor (sex) 
##  agej  data  =  imputations }  m  =  5}  famiLy  =  "binomial" ) 


## 

coef.est  coef.. 

##  (Intercept) 

0.46 

1.34 

##  surgery 1 

0.94 

0.66 

##  worst. gcs 

-0.09 

0.10 

##  factor(sex)Male 

-0.33 

0.77 

##  age 

0.00 

0.02 

##  n  =  41 j  k  =  5 

+ 


##  residual  deviance  =  59.3 ,  null  deviance  =  62.4  (difference  =  3.0) 


## 

##  Call: 

##  pool( formula  =  ever.sz  ~  surgery  +  worst. gcs  +  factor(sex)  + 
##  agej  data  =  imputations }  m  =  5}  famiLy  =  "binomial" ) 

## 

##  Deviance  Residuals : 

##  Min  IQ  Median  3Q  Max 

##  -1.6796  -1.1802  0.8405  1.0225  1.3824 
## 

##  Coefficients : 


## 

Estimate 

Std.  Error 

z  value 

Pr(> 1 z 1 ) 

##  (Intercept) 

0.459917 

1 . 344700 

0.342 

0.732 

##  surgery 1 

0.938109 

0.661646 

1.418 

0.156 

##  worst. gcs 

-0.089340 

0.098293 

-0.909 

0.363 

##  factor(sex)Male 

-0.332875 

0.770327 

-0.432 

0.666 

##  age 

0.001582 

0.019685 

0.080 

0.936 

## 

##  (Dispersion  parameter  for  binomial  famiLy  taken  to  be  1) 
## 

##  Null  deviance:  62.371  on  45  degrees  of  freedom 
##  Residual  deviance:  59.343  on  41  degrees  of  freedom 
##  AIC:  69.343 
## 

##  Number  of  Fisher  Scoring  iterations :  6.6 
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12.  Report  the  summaries  of  the  imputations . 


data. frames  <-  complete (imputations j  3) 
Lapply(data. frames j  summary) 

##  $' chain:!' 


## 


id 


age 


## 

Min. 

1.00 

Min.  : 16.00 

Female:  9 

Bike_vs_ 

Auto 

## 

1st  Qu. 

12.25 

1st  Qu. : 23.00 

Male  :37 

Blunt 

## 

Median 

23.50 

Median  : 33.00 

Fall 

## 

Mean 

23.50 

Mean  :36.89 

GSIaI 

## 

3rd  Qu. 

34.75 

3rd  Qu. :47. 25 

MCA 

## 

Max. 

46.00 

Max.  :83.00 

MVA 

## 

Peds_vs_ 

Auto 

## 

missing_ 

max. hr 

## 

Mode  : logical 

## 

FALSE: 28 

## 

TRUE  :18 

## 

NA's  :0 

#  extract  the  first  3  chains 


sex  mechanism 

4 
4 
13 
2 
7 

10 
6 


##  $' chain: 2 


## 

id 

age 

sex 

mechanism 

## 

Min.  :  1.00 

Min.  : 16.00 

Female:  9 

Bike_ 

vs_Auto 

4 

## 

1st  Qu. :12.25 

1st  Qu. : 23.00 

Male  :37 

Blunt 

4 

## 

Median  :23.50 

Median  : 33.00 

Fall 

13 

## 

Mean  :23.50 

Mean  :36.89 

GSIaI 

2 

## 

3rd  Qu. : 34.75 

3rd  Qu. :47. 25 

MCA 

7 

## 

Max.  :46.00 

Max.  :83.00 

MVA 

10 

## 

Peds_ 

vs_Auto 

6 

## 

missing_max. hr 

## 

Mode  : Logical 

## 

FALSE: 28 

## 

TRUE  :18 

## 

NA's  :0 

## 

##  $' chain: 3' 

## 

id 

age 

sex 

mechanism 

## 

Min.  :  1.00 

Min.  : 16.00 

Female:  9 

Bihe_ 

vs_Auto 

4 

## 

1st  Qu. :12.25 

1st  Qu. : 23.00 

Male  :37 

Blunt 

4 

## 

Median  :23.50 

Median  : 33.00 

Fall 

13 

## 

Mean  :23.50 

Mean  :36.89 

GSIaI 

2 

## 

3rd  Qu. : 34.75 

3rd  Qu. :47. 25 

MCA 

7 

## 

Max.  :46.00 

Max.  :83.00 

MVA 

10 

## 

Peds 

vs  Auto 

6 

##  missing_max.hr 
##  Mode  : Logical 
##  FALSE: 28 
##  TRUE  :18 
##  NA's  :0 


13.  Validation : 

Next,  we  can  verify  whether  enough  iterations  were  conducted.  One  validation 
criteria  demands  that  the  mean  of  each  completed  variable  is  similar  to  the 
corresponding  meen  of  the  complete  data  (Fig.  3.24). 


Frequency  Frequency  Frequency  Frequency  Frequency  Frequency 
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Held. gcs  (standardize) 


i[kx 
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c 
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^  o= - 
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Fig.  3.24  TBI  data  imputation  quality  plots 
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Fig.  3.24  (continued) 
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Fig.  3.24  (continued) 
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Fig.  3.24  (continued) 


#  should  be  similar  for  each  of  the  k  chains  (in  this  case  k=5). 

#  mipply  is  wrapper  for  sapply  invoked  on  mi-class  objects  to 


#  compute  the  col  means 

round(mippLy (imputations j  meanj  to. matrix  =  TRUE)j  3) 


## 

chain :1 

chain: 2 

chain: 3 

chain: 4 

chain: 5 

## 

id 

23.500 

23.500 

23.500 

23.500 

23.500 

## 

age 

0.000 

0.000 

0. 000 

0.  000 

0. 000 

## 

sex 

1.804 

1.804 

1.804 

1.804 

1.804 

•  • 

## 

missing_max. hr 

0.391 

0.391 

0.391 

0.391 

0.391 

#  Rhat  convergence  statistics  compares  the  variance  between  chains  to  the  va 
riance 

#  across  chains. 

#  Rhat  Values  ~  1.0  indicate  likely  convergence, 

#  Rhat  Values  >  1.1  indicate  that  the  chains  should  be  run  longer 

#  (use  large  number  of  iterations) 

Rhats (imputations ,  statistic  =  "moments")  #  assess  the  convergence  of  NI  alg 
orithm 


## 

mean_fieid.gcs 

mean_er .gcs 

mean_icu . gcs 

mean_iAjorst .  gcs 

mean_X6m. gose 

## 

1 . 858663 

2.200902 

1.484120 

2.360286 

1 . 752000 

## 

mean_spikes.hr 

mean_min.hr 

mean_max. hr 

sd_fieLd. gcs 

sd_er.gcs 

## 

1 . 972090 

1 . 393025 

1 . 743846 

1.291126 

1.479967 

## 

sd_icu . gcs 

sd_iAjorst .  gcs 

sd_X6m.gose 

sd_spikes.hr 

sd_min.hr 

## 

1.417884 

1 . 861408 

1 . 355365 

1 . 514089 

1 . 723325 

## 

sd_max.  hr 

## 

1 . 497625 
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#  When  convergence  is  unstable,  we  can  continue  the  iterations  for 

#  all  chains,  e.g. 

imputations  <-  mi(imputationSj  n .iter=20)  #  add  additional  20  iterations 

#  To  plot  the  produced  mi  results,  for  all  missing_variables  we  can  generate 

#  a  histogram  of  the  observed,  imputed,  and  completed  data. 

#  We  can  compare  of  the  completed  data  to  the  fitted  values 

#  implied  by  the  model  for  the  completed  data,  by  plotting  binned  residuals. 

#  hist  function  works  similarly  as  plot. 

#  image  function  gives  a  sense  of  the  missingness  patterns  in  the  data 
pLot( imputations); hist ( imputations ) 

image ( imputations);  summary ( imputations ) 


##  $id 

##  $id$is_missing 

##  [1]  "aLL  values  observed" 

## 

##  $id$observed 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu. 

##  1.00  12.25  23.50  23.50  34.75 

## 

## 

##  $age 

##  $age$is_missing 

##  [1]  "all  values  observed" 

## 

##  $age$observed 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu. 

##  -0.6045  -0.4019  -0.1126  0.0000  0.2997 

## 

## 

##  $sex 

##  $sex$is_missing 

##  [1]  "all  values  observed" 

## 


##  $sex$observed 
## 

##12 
##  9  37 


Max. 

46.ee 


Max. 

1 . 334e 


##  $late . sz$observed 
## 

##12 
##  20  26 
##  $ever.sz 

##  Sever . sz$is_missing 

##  [1]  "all  values  observed" 

## 

##  Sever . sz$observed 
## 

##12 
##  19  27 
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Fig.  3.25  Comparison  of  the  missing  data  patterns  in  the  original  (top)  and  the  completed  (bottom) 
TBI  sets 

14.  Pool  over  them  m  =  5  imputed  chains  when  fitting  the  " linear 
model" .  We  can  pool  from  across  the  five  chains  in  order  to 
estimate  the  final  linear  model  (Fig.  3.25)  . 

#  regression  model  and  impact  of  various  predictors 

modeL_resuLts  <-  pool(ever. sz  ~  surgery  +  worst. gcs  +  foctor(sex)  +  agej 
data  =  imputations j  m=5) ;  display  (model_results) ;  summary  (model_results) 

##  bayesglm( formula  =  ever.sz  ~  surgery  +  worst. gcs  +  factor (sex)  + 

##  age,  data  =  imputations,  m  =  5) 


## 

coef.est  coef.. 

##  (Intercept) 

0.58 

1.35 

##  surgery 1 

0.99 

0.66 

##  worst. gcs 

-0.11 

0.10 

##  factor(sex)Male 

-0.36 

0.77 

##  age 

0.00 

0.02 

##  n  =  41,  k  =  5 

##  residual  deviance  =  59.0 ,  null  deviance  =  62.4  (difference  =  3.4) 
##  Call: 

##  pool( formula  =  ever.sz  ~  surgery  +  worst. gcs  +  factor(sex)  + 

##  age ,  data  =  imputations,  m  =  5) 

## 

##  Deviance  Residuals : 

##  Min  IQ  Median  3Q  Max 

##  -1.6927  -1.1539  0.8245  1.0052  1.4009 
## 
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##  Coefficients : 

##  Estimate 


##  (Intercept)  0.578917 
##  surgeryl  0.990656 
##  lAforst.gcs  -0.105240 
##  factor(sex)MaLe  -0.357285 
##  age  0.000198 
## 


Std.  Error 
1 . 348831 
0.662991 
0.095335 
0.772307 
0.019702 


z  value  Pr( >\z\) 
0.429  0.668 

1.494  0.135 

-1.104  0.270 

-0.463  0.644 

0.010  0.992 


##  (Dispersion  parameter  for  binomial  family  taken  to  be  1) 
## 

##  Null  deviance:  62.371  on  45  degrees  of  freedom 
##  Residual  deviance:  58.995  on  41  degrees  of  freedom 
##  AIC:  68.995 
## 


##  Number  of  Fisher  Scoring  iterations :  7 


coef (summary (model_results) )[ j  1:2]  #  get  the  model  coef's  and  their  SE's 


## 

##  (Intercept) 

##  surgeryl 
##  worst.gcs 
##  factor(sex)Male 
##  age 


Estimate 

0.5789166170 

0.9906554934 

■0.1052399155 

■0.3572845034 

0.0001980042 


Std.  Error 
1.34883106 
0.66299111 
0.09533513 
0. 77230674 
0.01970153 


#  Report  the  summaries  of  the  imputations 

data. frames  <-  complete (imputations ,  3)  #  extract  the  first  3  chains 


lapply( data,  frames j 

summary) 

#  report  summaries 

##  [[i]] 

## 

id 

age 

sex 

mechanism 

## 

Min. 

1.00 

Min. 

:  16.00 

Female:  9 

Bike_ 

vs_Auto 

4 

## 

1st  Qu. 

12.25 

1st  Qu. 

: 23.00 

Male  :37 

Blunt 

4 

## 

Median 

23.50 

Median 

: 33 . 00 

Fall 

13 

## 

Mean 

23.50 

Mean 

: 36.89 

GSIaI 

2 

## 

3rd  Qu. 

34.75 

3rd  Qu. 

:47 . 25 

MCA 

7 

## 

Max. 

46.00 

Max. 

: 83 . 00 

MVA 

10 

## 

Peds_ 

vs_Auto 

6 

## 

missing_ 

max. hr 

## 

Mode  : logical 

## 

FALSE: 28 

## 

TRUE  :18 

## 

NA's  :0 

##  [[2]] 

## 

id 

age 

sex 

mechanism 

## 

Min. 

1.00 

Min. 

: 16.00 

Female:  9 

Bike_ 

vs_Auto 

4 

## 

1st  Qu. 

12.25 

1st  Qu. 

: 23 . 00 

Male  :37 

Blunt 

4 

## 

Median 

23.50 

Median 

: 33.00 

Fall 

13 

## 

Mean 

23.50 

Mean 

:36. 89 

GSIaI 

2 

## 

3rd  Qu. 

34.75 

3rd  Qu. 

:47. 25 

MCA 

7 

## 

Max. 

46.00 

Max. 

: 83 . 00 

MVA 

10 

## 

Peds_ 

vs_Auto 

6 

## 

missing_ 

max. hr 

## 

Mode  : logical 
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## 

FALSE: 28 

## 

TRUE  :18 

## 

NA's  :0 

## 

##  [[3]] 

## 

id 

age 

sex 

mechanism 

## 

Min.  :  1.00 

Min.  :16.00 

FemaLe:  9 

Bike_ 

vs_Auto:  4 

## 

1st  Qu. :12.25 

1st  Qu. : 23.00 

MaLe  :37 

BLunt 

:  4 

## 

Median  :23.50 

Median  : 33.00 

FaLL 

:13 

## 

Mean  :23.50 

Mean  :36.89 

GSIaI 

:  2 

## 

3rd  Qu. : 34 . 75 

3rd  Qu. :47. 25 

MCA 

:  7 

## 

Max.  :46.00 

Max.  :83.00 

MVA 

:10 

## 

Peds 

vs  Auto:  6 

## 

##  missing_max.hr 
##  Mode  : Logic a L 
##  FALSE: 28 
##  TRUE  :18 
##  MTs  :0 


3.13.3  Imputation  via  Expectation-Maximization 


Below  we  present  the  theory  and  practice  of  one  specific  statistical  computing 
strategy  for  imputing  incomplete  datasets. 


Types  of  Missing  Data 

Recall  that  we  have  the  following  three  distinct  types  of  incomplete  data. 

•  MCAR:  Data  which  is  Missing  Completely  At  Random  has  nothing  systematic 
about  which  observations  are  missing.  There  is  no  relationship  between 
missingness  and  either  observed  or  unobserved  co variates. 

•  MAR:  Missing  At  Random  is  weaker  than  MCAR.  The  missingness  is  still 
random,  but  solely  due  to  the  observed  variables.  For  example,  those  from  a 
lower  socioeconomic  status  (SES)  may  be  less  willing  to  provide  salary  informa¬ 
tion  (but  we  know  their  SES).  The  key  is  that  the  missingness  is  not  due  to  the 
values  which  are  not  observed.  MCAR  implies  MAR,  but  not  vice-versa. 

•  MNAR:  If  the  data  are  Missing  Not  At  Random,  then  the  missingness  depends  on 
the  values  of  the  missing  data.  Examples  include  censored  data,  self-reported  data 
for  individuals  who  are  heavier,  who  are  less  likely  to  report  their  weight,  and 
response-measuring  device  that  can  only  measure  values  above  0.5,  anything 
below  that  is  missing. 


General  Idea  of  EM  Algorithm 

Expectation-Maximization  (EM)  is  an  iterative  parameter  estimation  process  involv¬ 
ing  two  steps,  expectation  and  maximization ,  which  are  applied  in  tandem.  EM  can  be 
employed  to  find  parameter  estimates  using  maximum  likelihood  and  is  specifically 
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useful  when  the  equations  determining  the  relations  of  the  data-parameters  cannot  be 
directly  solved.  For  example,  a  Gaussian  mixture  modeling  assumes  that  each  data 
point  (X)  has  a  corresponding  latent  (unobserved)  variable  or  a  missing  value  (7), 
which  may  be  specified  as  a  mixture  of  coefficients  determining  the  affinity  of  the 
data  as  a  linear  combination  of  Gaussian  kernels,  determined  by  a  set  of  parameters 
(#),  e.g.,  means  and  variance-covariances.  Thus,  EM  estimation  relies  on: 

•  An  observed  data  set  X, 

•  A  set  of  missing  (or  latent)  values  7, 

•  A  parameter  0,  which  may  be  a  vector  of  parameters, 

•  A  likelihood  function  L{6\  X,  7)  =  p(X ,  71 6 ),  and 

•  The  maximum  likelihood  estimate  (MLE)  of  the  unknown  parameter(s)  6  that  is 
computed  using  the  marginal  likelihood  of  the  observed  data: 


L(0 |X)  =  p(X\0)  =  I  p{X ,  Y\0)dY. 

Most  of  the  time,  this  equation  may  not  be  directly  solved,  e.g.,  when  7 is  missing. 

•  Expectation  step  (E  step):  computes  the  expected  value  of  the  log  likelihood 
function ,  with  respect  to  the  conditional  distribution  of  Z  given  X  using  the 
parameter  estimates  at  the  previous  iteration  (or  at  the  position  of  initialization, 
for  the  first  iteration),  6t: 

Q(e\8{,))  =  Emeil)[log(L(0\X,Y)}; 


•  Maximization  step  (M  step):  Determine  the  parameters,  0,  that  maximize  the 
expectation  above, 


$0+i) 


=  arg  max 
e 


EM-Based  Imputation 


The  EM  algorithm  is  an  alternative  to  Newton-Raphson,  or  the  method  of  scoring, 
for  computing  MLE  in  cases  where  the  complications  in  calculating  the  MLE  are 
due  to  incomplete  observation  and  data  are  MAR,  missing  at  random,  with  separate 
parameters  for  observation  and  the  missing  data  mechanism,  so  the  missing  data 
mechanism  can  be  ignored. 


Complete  Data:  Z 


(xxT  xyt  \ 

V  YXT  YYt  ) 


where  X  is  the  observed  data  and  7  is  the  missing  data. 

'T 

•  E-step:  (Expectation)  Get  the  expectations  of  7  and  77  based  on  observed  data. 

•  M-step:  (Maximization)  Maximize  the  conditional  expectation  in  E-step  to  esti¬ 
mate  the  parameters. 
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Details:  If  o  =  obs  and  m  =  mis  stand  for  observed  and  missing,  the  mean  vector, 


(piobs,  bmis)T,  and  the  variance-covariance  matrix,  Z^ 


' 00 


' mo 


'om 


'mm 


represented  by: 


«(')  =  [  ^°bs 
ft  mis 


xW  = 


'OO 


'mo 


'om 


'mm 


E-step: 


E(Z\X) 


E(ZZ^)  =  f  ^  X£<y'X>r 
v  \E(y|x)xr  e(yyt |x) 


£(y|x)  —  fimis  +  zmozoJ  (x  -  nobs). 

E(YYt\X)  =  (Zmm  -  ZmoZ-x0Xom)  +  E{Y\X)E{Y\X)T . 


are 


M-step: 

M{,+1) = I  y3,"=i  £(zix) and  z(f+1) = -  Z"=  i  £(zzrix)  -  yM('+1v(,+1)r- 

A  Simple  Manual  Implementation  of  EM-Based  Imputation 

To  demonstrate  the  implementation  of  an  EM-based  imputation  method  from  first 
principles,  let’s  simulate  20  (feature)  vectors  of  200  (cases)  Normal  Distributed 
random  values. 

set . seed(202227) 

mu  <-  as.matrix(rep(2j 20)  ) 

sig  <-  diag(c(l:20)  ) 

#  Add  a  noise  item.  The  noise  is 

#  $  \epsilon  ~  MVN(as .matrix(rep(0, 20) ) ,  diag(rep(l, 20) ) )$ 
sim_dato  <-  mvrnorm(n  =  200 }  mu ,  sig)  + 

mvrnorm(n=200j  as.matrix(rep(0j20))j  diag(  rep(lj20)  )) 

#  save  these  in  the  "original"  object 
sim_data . orig  <-  sim_dotu 

#  introduce  500  random  missing  indices  (in  the  total  of  4000=200*20) 

#  discrete  distribution  where  the  probability  of  the  elements  of  values  is  p 
roportional  to  probs,  which  are  normalized  to  add  up  to  1. 

rand. miss  <-  el071 : : rdiscrete(500j  probs  =  rep(lj  Length (sim_data) ) j  vaLues  = 
seq(lj  Length (sim_data) ) ) 
sim_data[ rand .miss]  <-  NA 

sum(is.na(sim_data))  #  check  now  many  missing  (NA)  are  there  <  500 
##  [1]  459 

#  cast  the  data  into  a  data. frame  object  and  report  15*10  elements 
sim_data.df  <-  data. frame (sim_data) 

habLe(  sim_data . df[l : 15 j  l:10]j  caption  =  "The  first  15  rows  and  first  10  co 
Lumns  of  the  simuLation  data") 

The  first  15  rows  and  first  10  columns  of  the  simulation  data  are  included  below, 
mind  the  missing  values,  Table  3.1. 

Now,  let’s  define  the  EM  imputation  method  function: 
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3  Managing  Data  in  R 


EM_aLgorithm  <-  function(Xj  tol  =  0.001)  { 

#  identify  the  missing  data  entries  (Boolean  indices) 
missvals  <-  is.na(x) 

#  instantialize  the  EM-iteration 
new. impute  <-  x 

old .impute  <-  x 
count . iter  <-  1 
reach. toL  <-  0 

#  compute  \Sigma  on  complete  data 
sigma  <-  as .matrix(var(na. exclude (x))) 

#  compute  the  vector  of  feature  (column)  means 
mean.vec  <-  as ,matrix(apply(na. exclude (x) ,  2 ,  mean)) 

while  (reach .tol  !=  1)  { 
for  (i  in  l:nrow(x))  { 

pick. miss  <-  (c(missvals[ij  ])) 
if  (sum(pick.miss)  !=  0)  { 

#  compute  invese-Sigma_completeData.,  variance-covariance  matrix 
inv.S  <-  solve(sigma[ ! pick. miss j  ! pick. miss] j  tol  =  le-40) 

#  Expectation  Step 

#  $$E(Y|X)=\mu_{mis}+\Sigma_{mo}\Sigma_{oo}A{-l}(X-\mu_{obs})$$ 
new ,impute[ij  pick. miss]  <-  mean .vec[pick. miss]  + 

sigma[pick.missj  ! pick. miss]  %*%  inv.S  %*% 

(t(new.impute[ij  ! pick. miss] )  -  t(t(mean.vec[ ! pick. miss]))) 

} 

} 

#  Maximization  Step 

#  Compute  the  complete  \Sigma  complete  vector  of  feature  (column)  means 

#  $$\SigmaA{ (t+1) }  =  \frac{l}{n}\sum_{i=l}AnE(ZZAT|X)  - 
#  \muA{ (t+1) }{\muA{ (t+1) }}AT$$ 

sigma  <-  var( (new. impute)) 

#$$\muA{(t+l)}  =  \frac{l}{n}\sum_{i=l}AnE(Z |X)$$ 
mean.vec  <-  as. matrix(apply (new. impute j  2,  mean)) 

#  Inspect  for  convergence  tolerance,  start  with  the  2nd  iteration 
if  (count .iter  >  1)  { 

for  (L  in  1 : nrow( new. impute) )  { 
for  (m  in  1 :ncol (new. impute) )  { 

if  (abs( (old.impute[ Lj  m]  -  new. impute [ Lj  m]))  >  tol)  { 
reach. tol  <  -0 
}  else  { 

reach. tol  <-  1 

} 

} 

} 

} 

count. iter  <-  count. iter  +  1 
oLd. impute  <-  new. impute 

} 

#  return  the  imputation  output  of  the  current  iteration  that  passed  the 
tolerance  level 

return(new. impute ) 

} 

sim_data. imputed  <-  EM_aLgorithm(sim_data.dfj  toL=0. 0001 ) 


3.13  Missing  Data 
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Plotting  Complete  and  Imputed  Data 

Smaller  black  colored  points  represent  observed  data,  and  magenta-color  and  circle- 
shapes  denote  the  imputated  data  (Fig.  3.26). 


Fig.  3.26  Four  scatterplots  for  pairs  of  features  illustrating  the  complete  data  (small-black  points), 
the  imputed  data  points  (larger-pink  points),  and  2D  Gaussian  kernels 
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3  Managing  Data  in  R 


plot. me  <-  function (indexl,  index2){ 

plot .imputed  <-  sim_data. imputed [row. names ( 

subset (sim_data.df,  is,na(sim_data.df[j  indexl])  / 
is.na(sim_data.df[j  index2] )))j  ] 

p  =  ggplot(sim_data. imputed,  aes_string(  paste0("X" , indexl)  , 
paste0("X"j  index2  )))  + 

geom_point(aipha  =  0.5,  size  =  0. 7)+theme_bw( )  + 
stat_eLLipse(type  =  "norm",  color  =  "#000099" ,  alpha=0. 5)  + 
geom_point(data  =  plot .imputed,  aes_string(  paste0("X",  indexl)  , 
paste0("X", (index2))),size  =1.5,  color  =  "Magenta",  alpha  =0.8) 

} 

gridExtra : : grid. arrange (  plot.me(l,2),  plot.me(5,6),  plot.me(13,20), 
plot.me(18,19),  nrow  =  2) 

Validation  of  EM-Imputation  Using  the  Amelia  R  Package 

See  this  Amelia  paper  (https://gking.harvard.edu/files/gking/files/amelia_jss.pdf) 
and  the  corresponding  R  manual. 

Comparison 


Let’s  use  the  amelia  function  to  impute  the  original  data  sim_data_df  and  compare 
the  results  to  the  simpler  manual  EM_algorithm  imputation  defined  above. 

#  install. packages("Amelia") 
library (Amelia) 

dim( sim_data.df) 

##  [1]  200  20 

amelia. out  <-  amelia (sim_data . df,  m  =  5) 


##  --  Imputation  1  -- 
##  1  2  3  4  5  6 

##  --  Imputation  2  -- 
##  1  2  3  4  5  6 

##  --  Imputation  3  - - 
##  1  2  3  4  5  6 

##  --  Imputation  4  -- 
##  1  2  3  4  5  6 

##  21  22  23  24 

##  --  Imputation  5  - - 
##  1  2  3  4  5  6 

amelia. out 

## 

##  Amelia  output  with 
##  Return  code:  1 
##  Message:  Normal  EM 
## 

##  Chain  Lengths: 

## - 

##  Imputation  1:  20 

##  Imputation  2:  15 

##  Imputation  3:  16 

##  Imputation  4:  24 

##  Imputation  5:  17 


7  8  9  10  11  12  13 

7  8  9  10  11  12  13 

7  8  9  10  11  12  13 

7  8  9  10  11  12  13 

7  8  9  10  11  12  13 

imputed  datasets, 
convergence . 


14  15  16  17  18  19  20 
14  15 
14  15  16 

14  15  16  17  18  19  20 

14  15  16  17 


amelia. imputed. 5  <-  amelia. out$imputations[ [5] ] 
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•  Magenta-color  and  circle-shape  denote  manual  imputation  via 
EM_algorithm 

•  Orange-color  and  square-shapes  denote  Amelia  imputation  (Figs.  3.27  and 
3.28). 


Fig.  3.27  Scatter  plot  of  the  second  and  fourth  features.  Magenta-circles  and  Orange-squares 
represent  the  manual  imputation  via  EM_algorithm  and  the  automated  Amelia-based  imputation 


Fig.  3.28  Same  as  Fig.  3.27,  for  features  17  and  18 
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3  Managing  Data  in  R 


pLot.me2  <-  f unction (indexlj  index2){ 

plot .imputed  <-  sim_data. imputed [row. names ( 

subset (sim_data.dfj  is.na(sim_data.df[j  indexl])  / 
is.na(sim_data.df[j  index2] )))>  ] 

plot .imputed2  <-  ameLia. imputed. 5 [row. names ( 
subset (sim_data.dfj  is.na(sim_data.df[j  indexl])  / 
is.na(sim_data.df[j  index2] )))>  ] 

p  =  ggpiot(sim_data. imputed j  aes_string(  paste0("X" jindexl)  j 
paste0( "X"j  index2  )))  + 

geom_point (alpha  =  0.8}  size  =  0.  7)+theme_bw( )  + 
stat_eLLipse(type  =  "norm"j  color  =  "#000099" ,  alpha=0. 5)  + 
geom_point(data  =  plot . imputed j  aes_string(  paste0("X" jindexl)  , 
paste0("X"j  (index2)))jSize=2.5j  color="Magenta" j  aipha=0.9j  shape=16)  + 
geom_point(data  =  plot .imputed2j  aes(  XI  j  X2)jSize  =  2.5 , 
color  =  "#FF9933"j  alpha  =  0.8 j  shape  =18) 
return (p) 

} 

plot.me2(2j  4) 


plot.me2(17j  18) 


Density  Plots 

Finally,  we  can  compare  the  densities  of  the  original,  manually-imputed  and  Amelia- 
imputed  datasets.  Remember  that  in  this  simulation,  we  had  about  500  observations 
missing  out  of  the  4000  that  we  synthetically  generated  (Fig.  3.29). 


3.14  Parsing  Webpages  and  Visualizing  Tabular 
HTML  Data 


In  this  section,  we  will  utilize  the  Earthquakes  dataset  on  the  SOCR  website.  It  stores 
information  about  earthquakes  of  magnitudes  larger  than  5  on  the  Richter  scale  that 
were  recorded  between  1969  and  2007.  Here  is  how  we  download  and  parse  the  data 
on  the  source  webpage  and  ingest  the  information  into  R: 


#  install. packages("xml2") 

Library ( "XML ");  Library ( ”xmL2" ) 
library ("rye st") 

wihi_url  <-  read_html (" http :/ /wiki .  socr .  umich.edu/index.php/SOCR_Data_Dinoy 

021708_Earthquakes ") 

html_nodes ( wiki_url}  "#content ") 

##  { xml_nodeset  (1)} 

##  [1]  <diy  id=" content"  c Lass="mw-body -primary "  role="main">\n\t<a  id="top 


earthquake<-  html_table(html_nodes(wiki_urlJ  "table" ) [ [2] ] ) 


3.14  Parsing  Webpages  and  Visualizing  Tabular  HTML  Data 
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Fig.  3.29  Density  plots  of  the  original,  manually-imputed  and  Amelia-imputed  datasets, 
10  features  only 

In  this  dataset,  Magt  (magnitude  type)  may  be  used  as  grouping  variable.  We  will 
draw  a  “Longitude  vs.  Latitude”  line  plot  from  this  dataset.  The  function  we  are 
using  is  called  ggplot  ( ) ,  available  from  R  package  ggplot2.  The  input  type  for 
this  function  is  a  data  frame,  and  aes  ( )  specifies  the  axes  (Fig.  3.30). 
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Fig.  3.30  Earthquake  data  plot  of  magnitude  type  (color/shape)  against  longitude  (x)  and 
latitude  (y) 


i ibrary ( ggp Lot 2) 

pLot4<-ggpiot(earthguokej  oes (Longitude ,  Latitudej  group=Magtj  coLor=Magt))+ 
geom_point(dota=earthquakej  sxze=4}  mapping=aes(x=Longitudej  y=Latitudej 
shape=Magt) ) 

pLot4  #  or  plint(plot4) 

We  can  see  the  plotting  script  consists  of  two  parts.  The  first  part  ggp  lot 
(earthquake,  aes (Longitude ,  Latitude,  group  =  Magt, 
color=Magt)  )  specifies  the  setting  of  the  plot:  dataset,  group  and  color.  The 
second  part  specifies  that  we  are  going  to  draw  points  for  all  data  points.  In  later 
Chapters,  we  will  frequently  use  ggplot2;  which  always  takes  multiple  function 
calls,  e.g.,  functionl  +  function2. 

We  can  visualize  the  distribution  for  different  variables  using  density  plots.  The 
following  chunk  of  codes  plots  the  distribution  for  Latitude  among  different  Mag¬ 
nitude  types.  Also,  it  uses  the  ggplotO  function  combined  with 
geom_density  ( )  (Fig.  3.31). 

pLot5<-ggpLot (earthquake j  aes(Latitudej  size=l))+ 
geom_density (aes ( co  Lor=Magt ) ) 

pLotS 

We  can  also  compute  and  display  2D  Kernel  Density  and  3D  Surface  Plots. 
Plotting  2D  Kernel  Density  and  3D  Surface  plots  is  very  important  and  useful  in 
multivariate  exploratory  data  analytic. 

We  will  use  the  plot_ly  ( )  function  under  plotly  package,  which  takes  data 
frame  inputs. 

To  create  a  surface  plot,  we  use  two  vectors:  v  and  y,  with  length  m  and 
n ,  respectively.  We  also  need  a  matrix:  z  of  size  m  x  n.  This  z  matrix  is  created 
from  matrix  multiplication  between  v  and  y. 
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Fig.  3.31  Modified  Earthquake  density  plot  (y)  of  magnitude  type  against  latitude  coordinates  (x) 


The  kde2d  ( )  function  is  needed  for  2D  kernel  density  estimation. 


kernal_density  <-  with (earthquake ,  MASS: : hde2d( Longitude ,  Latitude,  n  =  50)) 

Here,  z  is  an  estimate  of  the  kernel  density  function.  Then,  we  apply  plot_ly  to 
the  list  kernal_density  via  the  with  ( )  function. 

Library (plot Ly) 

with(kernaL_density,  pLot_Ly(x=x ,  y=y,  z=z,  type=" surf  ace" ) ) 

Note  that  we  used  the  option  "  surface" ,  however  you  can  experiment  with  the 
type  option. 

Alternatively,  one  can  plot  ID,  2D,  or  3D  plots  (Fig.  3.32): 

pLot_Ly(x  =  ~  earthquake$Longitude) 

##  No  trace  type  specified : 

##  Based  on  info  supplied,  a  'histogram '  trace  seems  appropriate . 

##  Read  more  about  this  trace  type->https : //plot . Ly/r/reference/#histogram 

piot_Ly(x  =  ~  earthquake$Longitude,  y  =  ~earthquake$Latitude) 

piot_Ly (x=~earthquake$Longitude ,  y=~earthquake$Latitude,  z=~earthquake$Mag) 

df3D  <-  data. frame (x=earthquake$ Longitude,  y=earthquake$ Latitude, 
z=earthquake$Mag ) 

#  Convent  he  Long  (X,  Y,  Z)  Earthquake  format  data  into  a  Matrix  Format 

#  install . packages( "Matrix" ) 

L ibrary ( "Matrix ") 

matrix_EarthQuakes  <-  with (df 3D,  sparseMatrix(i  =  as .numeric (180- x) , 
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Fig.  3.32  Image 
representation  of  the  kernel- 
density  estimation  of  the 
Earthquake  magnitude 
rendered  as  a  heatmap 


j =as. numeric (y ) ,  x=z,  use.  Last .ij=Tj  dimnames=List( LeveLs(x) j  LeveLs(y)))) 
dim(matrix_EarthQuakes  ) 

##  [1]  307  44 

View(as .matrix(matrix_EarthQuakes ) ) 

#  view  matrix  is  2D  heatmap  : 

L ibrary ( "ggp Lot2")j  L ibrary ( "gp Lots") 

heatmap. 2(  as.matrix(matrix_EarthQuakes[280:307j  30:44] )}  Rowv=FALSEj 
CoLv=FALSEj  dendrogram= 'none ceLLnote=as .matrix(matrix_EarthQuakes[ 
280:307j  30:44])j  notecoL="bLack" ,  trace= ' none ' ,  key=FALSE}  Lwid  = 
c(.01j  .99) j  Lhei  =  c(.01j  .99) ,  margins  =  c(5j  15  )) 


#  Long  -180<x< -170,  Lat:  30<y<45,  Z:  5<Mag<8 

matrix_EarthQuakes  <-  with(df3Dj  sparseMatrix(i  =  as .numeric (180+x) } 
j =as. numeric (y ) j  x=z,  use.  Last. ij=TRUEj dimnames=List( LeveLs(x) j  LeveLs(y) ))) 
matl  <-  as .matrix (mat rix_EarthQuakes) 
pLot_Ly(z  =  ~matlj  type  =  "surface") 

#  To  plot  the  Aggregate  (Summed)  Magnitudes  at  all  Long/Lat: 
matrix_EarthQuakes  <-  with(df3Dj  sparseMatrix(i  =  as .numeric (180+x) } 
j =as. numeric (y ) j  x=z,  dimnames=List(LeveLs(x)j  LeveLs(y) ) ) ) 

matl  <-  as .matrix (mat rix_EarthQuabes) 
pLot_Ly(z  =  ~matlj  type  =  "surface") 

#  plot_ly(z  =  ~matl[30:60,  20:40],  type  =  "surface") 


o  csiro  lO  oo  O)  O  r-  CO  t^-co 
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Fig.  3.33  Live  demo  of  3D  kernel  density  surface  plots  using  the  Earthquake  and  2D  brain 
imaging  data  (http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/02_ManagingData. 
html) 


You  can  see  interactive  surface  plot  generated  by  plotly  in  the  live  demo  listed  on 
Fig.  3.33. 


3.15  Cohort-Rebalancing  (for  Imbalanced  Groups) 

Comparing  cohorts  with  imbalanced  sample  sizes  (unbalanced  designs)  may  present 
hidden  biases  in  the  results.  Frequently,  a  cohort-rebalancing  protocol  is  necessary  to 
avoid  such  unexpected  effects.  Extremely  unequal  sample  sizes  can  invalidate 
various  parametric  assumptions  (e.g.,  homogeneity  of  variances).  Also,  there  may 
be  insufficient  data  representing  the  patterns  belonging  to  the  minority  class(es) 
leading  to  inadequate  capturing  of  the  feature  distributions.  Although,  the  groups  do 
not  have  to  have  equal  sizes,  a  general  rule  of  thumb  is  0.5  <  <  2.  Tat  is  group 

sizes  where  one  group  is  more  than  an  order  of  magnitude  larger  than  the  size  of 
another  group  has  the  potential  for  bias. 

Example  1  Parkinson’s  Diseases  Study  involving  neuroimaging,  genetics, 
clinical,  and  phenotypic  data  for  over  600  volunteers  produced  multivariate  data 
for  three  cohorts:  HC— Healthy  Controls(166 ),  PD=Parkinson’s  (434), 
SWEDD  =  subjects  without  evidence  for  dopaminergic  deficit  (61)  (Figs.  3.34  and 
3.35). 
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Fig.  3.34  Validation  that  cohort  rebalancing  does  not  substantially  alter  the  distributions  of 
features.  This  QQ  plot  of  one  variable  shows  the  linearity  of  the  quantiles  of  the  initial  (x)  and 
rebalanced  (y)  data 


Fig.  3.35  Scatter  plot  of  the  raw  (x)  and  corrected/adjusted  (y)  p-values  corresponding  to  the  paired 
two-sample  Wilcoxon  non-parametric  test  comparing  the  raw  and  rebalanced  features 


#  update. packagesQ 

#  load  the  data:  06_PPMI_Classif icationValidationData_Short . csv 
ppmi_data  < -read . csv ( "https : //umich . instructure. com/ fiLes/330400/down Load ?do 
wnLoad_frd=l ",  header=TRUE ) 

table (ppmi_data$ResearchGroup) 

#  binarize  the  Dx  classes 

ppmi_data$ResearchGroup  <-  ifeLse(ppmi_data$ResearchGroup  ==  "Control" , 
"Control",  "Patient") 
attach ( ppmi_data ) 


head(ppmi_data) 
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#  Model-free  analysis,  classification 

#  install. packages("crossval") 

#  install. packages("ada") 

#  library( "crossval" ) 
require( crossval ) 
require(ada) 

#set  up  adaboosting  prediction  function 

#  Define  a  new  classification  result-reporting  function 

my.ada  <-  function  (train. x,  train. y,  test.x,  test.y,  negative,  formula) 

{ 

ada.fit  <-  ada(train.Xj  train. y) 
predict. y  <-  predict (ada .fit ,  test.x) 

#count  TP,  FP,  TN,  FN,  Accuracy,  etc. 

out  <-  confusionMatrix(test.yj  predict .y}  negative  =  negative) 

#  negative  is  the  label  of  a  negative  "null"  sample  (default:  "control"). 
return  (out) 

} 

#  balance  cases 

#  SMOTE:  Synthetic  Minority  Oversampling  Technique  to  handle  class 
misbalance  in  binary  classification. 

set.seed(1000) 

#  install. packages("unbalanced")  to  deal  with  unbalanced  group  data 
require(unbalanced) 

ppmi_data$PD  <-  ifelse(ppmi_data$ResearchGroup=="Control"j  1}  0) 
unique ID  <-  unique (ppmi_data$FID_IID) 
ppmi_data  <-  ppmi_data[ppmi_data$VisitID==lj  ] 
ppmi_data$PD  <-  factor (ppmi_data$PD) 

colnames ( ppmi_data ) 

#  ppmi_data . 1< - ppmi_data [ ,  c(3:281,  284,  287,  336:340,  341)] 
n  <-  ncol(ppmi_data) 

out put. 1  <-  ppmi_data$PD 

#  remove  Default  Real  Clinical  subject  classifications! 
ppmi_data$PD  <-  ifeise(ppmi_data$ResearchGroup=="Control"j  1 ,  0) 
input  <-  ppmi_data[  ,  -which ( names (ppmi_data)  %in%  c( "ResearchGroup" , 

" PD ",  "X",  "FID_IID"  )  )  ] 

#  output  <-  as .matrix(ppmi_data[  ,  which(names(ppmi_data)  %in%  {"PD"})]) 
output  <-  as. factor (ppmi_data$PD) 

c(dim(input)j  dim(output) ) 

#balance  the  dataset 

data. l<-ubBa Lance (X=  input ,  Y=outputj  type="ubSMOTE" ,  percOver=300j 
percUnder=150j  verbose=TRUE) 

#  percOver  =  A  number  that  drives  the  decision  of  how  many  extra  cases  from 
the  minority  class  are  generated  (known  as  over-sampling). 

#  k  =  A  number  indicating  the  number  of  nearest  neighbors  that  are  used  to  g 
enerate  the  new  examples  of  the  minority  class. 

#  percllnder  =  A  number  that  drives  the  decision  of  how  many  extra  cases  from 
the  majority  classes  are  selected  for  each  case  generated  from  the  minority 
class  (known  as  under-sampling) 
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baLancedData<-cbind( data. 1$X}  data.l$Y) 
tabLe(data. 1$Y) 

nrow(data.l$X) ;  ncoL(data . 1$X) 
nrow( baLancedData) ;  ncoL ( baLancedData ) 
nrow (input) ;  ncoL(input) 

coLnames (baLancedData)  <-  c (coLnames (input) ,  "PD") 

#  check  visually  for  differences  between  the  distributions  of  the  raw 
(input)  and  rebalanced  data  (for  only  one  variable,  in  this  case) 
qqpiot (input [j  5 J,  baLancedData  [,  5]) 


###Check  balance 
##  Nilcoxon  test 
alpha. 0.05  <-  0.05 

test . results . bin  <-  NULL  #  binarized/dichotomized  p-values 

test . results . raw  <-  NULL  #  raw  p-values 

for  (i  in  1 : (ncoi( baLancedData) -1) ) 

{ 

test .  results .  raiAj[i]<-wiicox.test(input[j  i]>  baLancedData  [,  i] )$p.  value 
test.resuLts.bin  [i]  <-  ifeise(test .results .raw  [i]  >  alpha. 0.05}  lj  0) 
print(c("i="j  ij  "Nilcoxon-test=" ,  test . results . raw  [i])) 

} 

print (c("Nilcoxon  test  results:  ",  test.results.bin)) 


test . results . corr  <-  stats: :p. ad just (test. results. raw ,  method  = 
ength( test . results . raw) ) 

#  where  methods  are  "holm”,  "hochberg",  "hommel",  "bonferroni", 
"fdr",  "none") 

plot( test .results .raw j  test . results . corr) 


" fdr ",  n  =  l 


"BH", 


"BY" 


i 


#  zeros  (0)  are  significant  independent  between-group  T-test  differences, 
ones  (1)  are  insignificant 

#  Check  the  Differences  between  the  rate  of  significance  between  the  raw  and 
FDR-corrected  p-values 

test.results.bin  <-  if else(test . results . raw  >  alpha. 0. 05 ,  1,  0) 
table( test.results.bin) 

test.resuLts.corr.bin  <-  if else(test. results. corr  >  alpha. 0. 05 ,  1,  0) 
table( test. resuLts.corr.bin) 


3.16  Appendix 

3.16.1  Importing  Data  from  SQL  Databases 

We  can  also  import  SQL  databases  into  R.  First,  we  need  to  install  and  load  the 
RODBC(R  Open  Database  Connectivity)  package. 

install. packages (" RODBC" j  repos  =  "http://cran.us.r-project.org") 

Library (RODBC) 
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Then,  we  could  open  a  connection  to  the  SQL  server  database  with  Data  Source 
Name  (DSN),  via  Microsoft  Access.  More  details  are  provided  online. 


3.16.2  R  Code  Fragments 

Below  are  some  code  snippets  used  to  generate  some  of  the  graphs  shown  in  this 
Chapter. 

#Right  Skewed 
N  <-  10000 

x  <-  rnbinom(Nj  10,  .5) 
hist(Xj 

xLim=c(min(x) ,  max(x)),  probabiLity=Tj  ncLass=max(x) -min(x)+l, 
coL=  '  LightbLue ' ,  xLob='  ’,  yLab='  ’,  axes=F, 
main= ' Right  Skewed') 

Lines (density (x,  bw=l),  coi= ' red ' j  Lwd=3) 

#No  Skew 
N  <-  10000 
x  <-  rnorm(Nj  0,  1) 
hist(Xj  probabiLity=Tj 

coL=  '  LightbLue  '  j  xLab='  ',  yLab='  ',  axes=Fj 
main='No  Skew') 

Lines (density (x,  bw=0.4)j  coL=  ' red ' ,  Lwd=3) 

#Uniform  density 
x<-runif(1000,  1,  50) 

hist(x,  coL=  '  LightbLue ' j  main=" Uniform  Distribution" j  probabiLity  =  T,  xLab= 
"",  yLab="Density" ,  axes=F) 
abLine(h=0.02j  coL=  ' red ' ,  Lwd=3) 

#68-95-99.7  rule 
x  <-  rnorm(Nj  0}  1) 
hist(Xj  probabiLity =Tj 

coL=  '  LightbLue  '  j  xLab='  yLab='  axes  =  F , 
main=' 68-95-99.7  RuLe') 

Lines (density (Xj  bw=0.4)}  coL=  ' red ' ,  Lwd=3) 

axis(lj  at=c(-3j  -2,  - 1 ,  0 ,  1}  2 ,  3),  LabeLs  =  expression (mu -3* sigma , 
mu-2*sigma}  mu-sigma ,  mu ,  mu+sigma ,  mu+2*sigmaj  mu+3*sigma)) 


abLine( v=-l} 

Lwd=3j  Lty=2) 

abLine( v=l3 

Lwd=3j  Lty=2) 

abLine( v=-23 

Lwd=3j  Lty=2) 

abLine( v=23 

Lwd=3j  Lty=2) 

abLine( v=-3 3 

Lwd=3j  Lty=2) 

abLine( v=33 

Lwd=3j  Lty=2) 

text(0j  0. 2j 

”68%") 

segments ( -13 

0.2j  -0.3j  0.2j  coL 

=  'red' 

,  Lwd=2) 

segments (lj 

0.2j  0.3j  0.2j  coL  = 

'red' , 

Lwd=2) 

text(0j  0.15 

>3  "95%") 

segments ( -2} 

0.15 }  -0.3,  0.15 ,  coL  =  'red',  Lwd=2) 

segments (2 j 

0.15,  0.3,  0.15,  coL 

=  'red' 

,  Lwd=2) 

text(03  0.1 j 

"99.7%") 

segments ( -3, 

0.1,  -0.3,  0.1 ,  coL 

=  'red' 

,  Lwd=2) 

segments (3 , 

0.1 ,  0.3,  0.1 ,  coL  = 

'red' , 

Lwd=2) 
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3.17  Assignments:  3.  Managing  Data  in  R 
3.17.1  Import,  Plot,  Summarize  and  Save  Data 

Load  the  following  two  datasets,  generate  summary  statistics  for  all  variables,  plot 
some  of  the  features  (e.g.,  histograms,  box  plots,  density  plots,  etc.),  and  save  the 
data  locally  as  CSV  files: 

•  ALS  case-study  data,  https://umich.instructure.com/courses/38100/files/folder/ 
Case_Studies/15_ALS_CaseStudy. 

•  SOCR  Knee  Pain  Data,  http://wiki.socr.umich.edu/index.php/SOCR_Data_ 
KneePainData_04 1 409 . 


3.17.2  Explore  some  Bivariate  Relations  in  the  Data 

Use  ALS  case-study  data  or  SOCR  Knee  Pain  Data  to  explore  some  bivariate 
relations  (e.g.  bivariate  plot,  correlation,  table,  crosstable,  etc.) 

Use  07_UMich_AnnArbor_MI_TempPrecipitation_HistData_1900_2015  data  to 
show  the  relations  between  temperature  and  time.  [Hint:  use  geom  line  or 
geom_bar]. 

Some  sample  code  for  dealing  with  the  table  of  temperatures  data  is  included 
below. 


<code> 

Temp_Data  <-  as .data. frame (read. csv( "https : //umich . instructure . com/files/706163/download? 
download_frd=l"J  header =T,  na.  strings=c(""J  "NA ",  "NR"))) 

summary (Temp_Data ) 

#  View(Temp_Data) j  coLnames(Temp_Data) 

#  Mide-to-Long  transformation :  reshape  arguments  include 

#  (1)  List  of  variable  names  that  define  the  different  times  or  metrics  (varying) 

#  (2)  the  name  we  wish  to  give  the  variable  containing  these  values  in  our  Long  dataset  ( 
v. names ) } 

#  (3)  the  name  we  wish  to  give  the  variable  describing  the  different  times  or  metrics  (ti 
mevar)j 

#  (4)  the  values  this  variable  will  have  (times) j  and 

#  (5)  the  end  format  for  the  data  (direction) 

#  Before  reshaping  make  sure  all  data  types  are  the  same  as  putting  them  in  1  column  will 

#  otherwise  generate  inconsistencies/errors 
colN  <-  colnames(Temp_Data[ j  -1] ) 

LongTempData  <-  reshape  (Temp_Data}  varying  =  coLN}  v.  names  =  "Temps"  timevar=" Months "  j  t 
imes  =  coLNj  direction  =  "Long") 

#  View (LongTempData) 

bar2  <-  ggp Lot ( LongTempData j  aes(x  =  Months j  y  =  Temps j  fill  =  Months))  + 
geom_bar(stat  =  "identity" ) 
print (bar2) 

bar 3  <-  ggp Lot ( LongTempData ,  aes(x  =  Yearj  y  =  Temps }  fill  =  Months))  + 
geom_bar(stat  =  "identity" ) 
print (bar3) 

p  <-  ggplot( LongTempDataj  aes(x=Yearj  y=as .integer(Temps) }  colour=Months) )  + 
geom_Line( ) 

P 

</code> 
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3.17.3  Missing  Data 

Introduce  (artificially)  some  missing  data,  impute  the  missing  values  and  examine 
the  differences  between  the  original,  incomplete,  and  imputed  data. 


3.17.4  Surface  Plots 

Generate  a  surface  plot  for  the  (RF)  Knee  Pain  data  illustrating  the  2D  distribution  of 
locations  of  the  patient  reported  knee  pain  (use  plot_ly  and  kernel  density 
estimation). 


3.17.5  Unbalanced  Designs 

Rebalance  the  groups  of  ALS  (training  data)  patients  according  to  Age  >50  and 
Age  <50  using  synthetic  minoroty  oversampling  (SMOTE)  to  ensure  approximately 
equal  cohort  sizes.  (Hint:  you  may  need  to  set  1  as  the  minority  class.) 


3.17.6  Aggregate  Analysis 

Use  the  California  Ozone  Data  to  generate  a  summary  report.  Make  sure  to  include: 
summary  for  every  variable,  the  structure  of  the  data,  convert  to  appropriate  data 
type,  discuss  the  tendency  of  the  ozone  average  concentration,  explore  the  differ¬ 
ences  of  the  ozone  concentration  a  specific  area  (you  may  select  year  2006),  explore 
the  seasonal  change  of  ozone  concentration. 
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In  this  chapter,  we  use  a  broad  range  of  simulations  and  hands-on  activities  to 
highlight  some  of  the  basic  data  visualization  techniques  using  R.  A  brief  discussion 
of  alternative  visualization  methods  is  followed  by  demonstrations  of  histograms, 
density,  pie,  jitter,  bar,  line  and  scatter  plots,  as  well  as  strategies  for  displaying  trees, 
more  general  graphs,  and  3D  surface  plots.  Many  of  these  are  also  used  throughout 
the  textbook  in  the  context  of  addressing  the  graphical  needs  of  specific  case-studies. 

It  is  practically  impossible  to  cover  all  options  of  every  different  visualization 
routine.  Readers  are  encouraged  to  experiment  with  each  visualization  type,  change 
input  data  and  parameters,  explore  the  function  documentation  using  R-help  (e.g., 
?plot),  and  search  online  for  new  R  visualization  packages  and  new  functionality, 
which  are  continuously  being  developed. 

We  will  cover  (1)  one  specific  classification  of  visualization  methods,  (2)  com¬ 
position  (e.g.,  density,  histogram),  comparison  (e.g.,  jitter,  bar,  correlation)  and 
relationship  (e.g.,  line)  plots,  (3)  2D  kernel  density  and  3D  surface  plots,  and 
(4)  3D  and  4D  visualization  of  solids,  (hyper) volumes. 


4.1  Common  Questions 

•  What  exploratory  visualization  techniques  are  available  to  graphically  interrogate 
my  specific  data? 

•  How  do  we  examine  paired  associations  and  correlations  in  a  multivariate 
dataset? 


©  Ivo  D.  Dinov  2018 
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4.2  Classification  of  Visualization  Methods 

Scientific  data-driven  or  simulation-driven  visualization  methods  are  hard  to  clas¬ 
sify.  The  following  list  of  criteria  can  be  used  to  characterize  alternative  data 
visualization  strategies: 

•  Data  Type:  structured/unstructured,  small/large,  complete/incomplete,  time/ 
space,  ASCII/binary,  Euclidean/non-Euclidean,  etc. 

•  Task  type:  Task  type  is  one  of  the  aspects  considered  in  classification  of 
visualization  techniques,  which  provides  a  means  of  interaction  between  the 
researcher,  the  data,  and  the  display  software/platform 

•  Scalability:  Visualization  techniques  are  subject  to  some  limitations,  such  as  the 
amount  of  data  that  a  particular  technique  can  exhibit 

•  Dimensionality:  Visualization  techniques  can  also  be  classified  according  to  the 
number  of  attributes 

•  Positioning  and  Attributes:  the  distribution  of  attributes  on  the  chart  may  affect 
the  interpretation  of  the  display  representation,  e.g.,  correlation  analysis,  where 
the  relative  distance  among  the  plotted  attributes  is  relevant  for  observation 

•  Investigative  Need:  the  specific  scientific  question  or  exploratory  interest  may 
also  determine  the  type  of  visualization: 

-  Examining  the  composition  of  the  data 

-  Exploring  the  distribution  of  the  data 

-  Contrasting  or  comparing  several  data  elements,  relations,  association 

-  Unsupervised  exploratory  data  mining. 

Also,  we  have  the  following  table  for  common  data  visualization  methods 
according  to  task  types  (Fig.  4.1): 

We  introduce  common  data  visualization  methods  according  to  this  classification 
criterion,  albeit  this  is  not  a  unique  or  even  broadly  agreed  upon  ontological 
characterization  of  exploratory  data  visualization. 


4.3  Composition 

In  this  section,  we  will  see  composition  plots  for  different  types  of  variables  and  data 
structures. 


4.3.1  Histograms  and  Density  Plots 

One  of  the  first  few  graphs  we  learn  is  a  histogram  plot.  In  R,  the  command  hist  ( ) 
is  applied  to  a  vector  of  values  and  used  for  plotting  histograms.  The  famous 
nineteenth  century  statistician  Karl  Pearson  introduced  histograms  as  graphical 
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Fig.  4.1 


Schematic  depiction  of  a  hierarchical  classification  of  different  visualization  methods 


representations  of  the  distribution  of  a  sample  of  numeric  data.  The  histogram  plot 
uses  the  data  to  infer  and  display  the  probability  distribution  of  the  underlying 
population  that  the  data  is  sampled  from.  Histograms  are  constructed  by  selecting 
a  certain  number  of  bins  covering  the  range  of  values  of  the  observed  process. 
Typically,  the  number  of  bins  for  a  data  array  of  size  N  should  be  equal  to  VN.  These 
bins  form  a  partition  (disjoint  and  covering  sets)  of  the  range.  Finally,  we  compute 
the  relative  frequency  representing  the  number  of  observations  that  fall  within  each 
bin  interval.  The  histogram  just  plots  a  piece-wise  step-function  defined  over  the 
union  of  the  bin  interfaces  whose  height  equals  the  observed  relative  frequencies 
(Fig.  4.2). 
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Fig.  4.2  Overlay  of  Normal 
distribution  histogram  and 
density  curve  plot 


Histogram  of  x 
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X 


Fig.  4.3  Overlay  of 
histogram  plot  and  a  density 
curve  estimate 


Histogram  of  x 


-2  0 


2  4 


x 


set.seed(l) 
x<-rnorm( 1000) 

hist(Xj  freq=Tj  breaks  =10) 

Lines(density(x)j  Lwd=2j  coL="bLue") 
t  <-  seq(-3j  3j  by=0.01) 

Lines(tj  550*dnorm(tj0jl)j  coL="magenta" )  #  add  the  theoretical  density  line 

Here,  f  req=T  shows  the  frequency  for  each  v  value  and  breaks  controls  for 
number  of  bars  in  our  histogram. 

The  shape  of  the  last  histogram  we  drew  is  very  close  to  a  Normal  distribution 
(because  we  sampled  from  this  distribution  by  rnorm).  We  can  add  a  density  line  to 
the  histogram  (Fig.  4.3). 


hist(Xj  freq=Fj  breaks  =  10) 

Lines  (density  (x)  j  L\Aid=2j  coL="bLue") 
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Fig.  4.4  Direct  plot  of  the 
estimated  Normal 
distribution  density  curve 


</> 

c 

m 

□ 


density. defaults  =  x) 


N  =  1000  Bandwidth  =  0  2338 


We  used  the  option  freq=F  to  make  the  y  axis  represent  the  “relative  fre¬ 
quency”,  or  “density”.  We  can  also  use  plot  (density  (x)  )  to  draw  the  density 
plot  by  itself  (Fig.  4.4). 


pLot (density (x) ) 


4.3.2  Pie  Chart 

We  are  all  very  familiar  with  pie  charts  that  show  us  the  components  of  a  big  “cake”. 
Although  pie  charts  provide  effective  simple  visualization  in  certain  situations,  it 
may  also  be  difficult  to  compare  segments  within  a  pie  chart  or  across  different  pie 
charts.  Other  plots  like  bar  chart,  box  or  dot  plots  may  be  attractive  alternatives. 

We  will  use  the  Letter  Frequency  Data  on  SOCR  website  to  illustrate  the  use  of 
pie  charts. 

Library (rvest) 

wihi_urL  <-  read_htmL(  'http://wiki.socr.  umich.edu/index.php/SOCR_LetterFrequ 
encyData  ") 

htmL_nodes ( \Aiiki_urLj  "#content ") 

##  {xmL_nodeset  (1)} 

##  [1]  <div  id=" content"  cLass="mw-body -primary"  roLe="main" >\n\t<a  id="top 

•  •  • 

Letter<-  htmL_tabLe(htmL_nodes(iAjihi_urLj  "tabLe" ) [ [1] ] ) 
summary ( Letter ) 
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## 

## 

## 

## 

## 

## 

## 

## 


Letter 
Length: 27 
CLass  : character 
Mode  : character 


Spanish 


EngLish 

Min.  : 0.00000 
1st  Qu. : 0.01000 
Median  : 0.02000 
Mean  : 0.03667 
3rd  Qu. :0. 06000 
Max.  : 0.13000 
Portuguese 


French 

Min.  : 0.00000 

1st  Qu. :0. 01000 
Median  : 0.03000 
Mean  : 0.037 04 

3rd  Qu. : 0.06500 
Max.  : 0.15000 

Esperanto 


German 

Min.  : 0.00000 

1st  Qu. : 0.01000 
Median  : 0.03000 
Mean  : 0.03741 

3rd  Qu. :0.05500 
Max.  : 0.17000 

ItaLian 


## 

Min. 

0.00000 

Min. 

0. 00000 

Min. 

0. 00000 

Min. 

: 0 . 00000 

## 

1st  Qu. 

0. 01000 

1st  Qu. 

0. 00500 

1st  Qu. 

0. 01000 

1st  Qu. 

:0. 00500 

## 

Median 

0.03000 

Median 

0. 03000 

Median 

0. 03000 

Median 

: 0.03000 

## 

Mean 

0.03815 

Mean 

0.03778 

Mean 

0.03704 

Mean 

: 0.03815 

## 

3rd  Qu. 

0. 06000 

3rd  Qu. 

0. 05000 

3rd  Qu. 

0. 06000 

3rd  Qu. 

: 0. 06000 

## 

Max. 

0.14000 

Max. 

0.15000 

Max. 

0.12000 

Max. 

: 0.12000 

## 

Turhi 

sh 

SiA/edi 

sh 

PoLi 

ish 

Toki_ 

Pona 

## 

Min. 

0.00000 

Min. 

0.00000 

Min. 

0. 00000 

Min. 

: 0 . 00000 

## 

1st  Qu. 

0. 01000 

1st  Qu. 

0. 01000 

1st  Qu. 

0.01500 

1st  Qu. 

:0. 00000 

## 

Median 

0.03000 

Median 

0. 03000 

Median 

0. 03000 

Median 

: 0.03000 

## 

Mean 

0.03667 

Mean 

0.03704 

Mean 

0.03704 

Mean 

: 0.03704 

## 

3rd  Qu. 

0.05500 

3rd  Qu. 

0.05500 

3rd  Qu. 

0. 04500 

3rd  Qu. 

:0. 05000 

## 

Max. 

0.12000 

Max. 

0. 10000 

Max. 

0. 20000 

Max. 

: 0.17000 

## 

Dutch 

Avgerage 

## 

Min. 

'0.00000 

Min. 

'0. 00000 

## 

1st  Qu. 

0 . 01000 

1st  Qu. 

'0 . 01000 

## 

Median 

'0.02000 

Median 

'0. 03000 

## 

Mean 

'0.03704 

Mean 

■0.03741 

## 

3rd  Qu. 

0 . 06000 

3rd  Qu. 

0 . 06000 

## 

Max. 

0. 19000 

Max. 

'0.12000 

We  can  try  to  plot  the  frequency  for  the  first  ten  English  letters.  The  left  hand  side 
shows  a  table  made  by  the  function  legend  (Fig.  4.5). 

par(mfro\Aj=c(lj  2)) 

pie( Letter$EngLish[l :10] j  LabeLs=Letter$Letter[l:10] }  coL=rainbow(10j  start= 
0.1j  end=0.8)j  c Lockwise=TRUEj  main=" First  10  Letters  Pie  Chart") 
pie( Letter$EngLish[l : 10] j  LabeLs=Letter$Letter[l: 10] j  coL=rainbow(10j  start= 
0 . lj  end=0.8)j  cLochwise=TRUEj  main="First  10  Letters  Pie  Chart") 
Legend("topLeft"j  Legend=Letter$Letter[l:10] j  cex=1.3}  bty="n" }  pch=15j  pt.c 
ex=1.8j  coL=rainbow(10j  start=0.1j  end=0.8)j  ncoL=l) 


Fig.  4.5  Pie  chart  showing 
the  frequency  of  English 
use  of  the  first  ten  Latin 
letters  (a-j) 
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The  input  type  for  pie  ( )  is  a  vector  of  non-negative  numerical  quantities.  In  the 
pie  function,  we  list  the  data  that  we  are  going  to  use  (positive  and  numeric),  the  labels 
for  each  of  them,  and  the  colors  we  want  to  use  for  each  sector.  In  the  1  egend  function, 
we  put  the  location  in  the  first  slot  and  use  legend  as  the  labels  for  the  colors.  Also 
cex,  bty,  pch,  and  pt .  cex  are  all  graphic  parameters  that  we  have  talked  about  in 
Chap.  2. 

More  elaborate  pie  charts,  using  the  Latin  letter  data,  will  be  demonstrated  using 
ggplot  below  (in  the  Appendix  of  this  chapter). 


4.3.3  Heat  Map 

Another  common  data  visualization  method  is  the  heat  map.  Heat  maps  can  help  us 
visualize  intuitively  the  individual  values  in  a  matrix.  It  is  widely  used  in  genetics 
research  and  financial  applications. 

We  will  illustrate  the  use  of  heat  maps,  based  on  a  neuroimaging  genetics  case- 
study  data  about  the  association  (p-values)  of  different  brain  regions  of  interest 
(ROIs)  and  genetic  traits  (SNPs)  for  Alzheimer’s  disease  (AD)  patients,  subjects 
with  mild  cognitive  impairment  (MCI),  and  normal  controls  (NC).  First,  let’s  import 
the  data  into  R.  The  data  are  2D  arrays  where  the  rows  represent  different  genetic 
SNPs,  columns  represent  brain  ROIs,  and  the  cell  values  represent  the  strength  of  the 
SNP-ROI  association,  a  probability  value  (smaller  p-values  indicate  stronger 
neuroimaging-genetic  associations). 


AD_Data  <-  read.  tabie( "https : //umich .  instructure.  com/fiLes/330387/doiA/nLoad?d 
oiA/nLoad_frd=l"j  header=TRUEj  row.names=lj  sep="j"j  dec=".") 

MCI_Data  <-  read .tab Le( "https : //umich . instructure . com/fiies/ 330390/ download? 
do]Ainioad_frd=l"j  header=TRUEj  row.names=lj  sep="j"j  dec=".") 

NC_Data  <-  read.  tabLe(  "https : //umich .  instructure .  com/files/330391/do\Ainload?d 
o\Ainioad_frd=l" j  header=TRUEj  row.names=lj  sep="}"j  dec=".") 

Then,  we  load  the  R  packages  we  need  for  heat  maps  (use  install. pack¬ 
ages  ( "package  name " )  first  if  you  have  not  previously  install  them). 


require(graphics ) 
require(grDevices ) 

Library(gpLots) 

We  can  convert  the  datasets  into  matrices. 


AD_mat  <-  as .matrix(AD_Data) ;  cLass(AD_mat)  <-  "numeric" 
MCI_mat  <-  as .matrix (MCI_Data) ;  class (MCI_mat)  <-  "numeric 
NC_mat  <-  as.matrix(NC_Data) ;  class (NC_mat)  <-  "numeric" 


We  may  also  want  to  set  up  the  row  (re)  and  column  (cc)  colors  for  each  cohort. 
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AD  Cohort 


rs9896368 
rsl  1951324 
rs  1788070 
rs4S  07567 
rs  157580 
rs2799182 
rs2075650 
rsl 7246639 
rs  790  236  8 
rs4394Q1 
rsl  048009 
rs462074 
rs7326l37 
rs7 146951 
rsl  7430865 
rs4829606 
rsl  701 506 
rsl  469259 
rsl  6885921 
rsl  981 542 
rsl  1680332 
rsl  2352272 
rs  28  8  503 
rs 28  8 496 
rsl  490853 
rs6451 84 
rs259659 
rs2l379$2 
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Fig.  4.6  Hierarchically  clustered  heatmap  for  the  Alzheimer’s  disease  (AD)  cohort  of  the  dementia 
study.  The  rows  indicate  the  unique  SNP  reference  sequence  (rs)  IDs  and  the  columns  index  specific 
brain  regions  of  interest  (ROIs)  that  are  associated  with  the  genomic  biomarkers  (rows) 


rcAD  <-  rainbow ( nr ow ( AD_mat ) }  start  =  0}  end  =  1.0) ;  ccAD<-rainbow(ncoL(AD_m 
at)j  start  =  0,  end  =  1.0) 

rcMCI  <-  rainbow (nrow (MCI_mat) ,  start  =  0,  end=1.0);  ccMCI<-rainbow(ncoL(MCI 
_mat)j  start=0j  end=1.0) 

rcNC  <-  rainbow(nrow(NC_mat) j  start  =  0}  end  =  1.0) ;  ccNC<-rainbow(ncoL(NC_m 
at)j  start  =  0,  end  =  1.0) 

Finally,  we  can  plot  the  heat  maps  by  specifying  the  input  type  of  heatmap  ( )  to 
be  a  numeric  matrix  (Figs.  4.6,  4.7,  and  4.8). 

hvAD  <-  heatmap (AD_matj  coL=cm.coLors(256) j  scaLe=" column" ,  RowSideCoLors  = 
rcADj  CoLSideCoLors  =  ccADj  margins  =  c(2j  2) ,  main="AD  Cohort") 


hvMCI  <-  heatmap(MCI_matj  coi  =  cm. coLors(256) j  scaLe  =  "column" }  RowSideCoL 
ors  =  rcMCIj  CoLSideCoLors  =  ccMCIj  margins  =  c(2}  2)}  main="MCI  Cohort") 
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MCI  Cohort 


rs49722$9 
rsl  1899232 
rs793291 
rs43075S7 
rsl  201 6260 
rs72G3933 
rsa027939 
rsl  12E2S39 
rsl  601 406 
rsl  701  SO  6 
rsl  082001 3 
rs25131 12 
rs2697269 
rsl  2822144 
rs2 137962 
rs7326l37 
rs2029647 
rs  36  42979 
rs6090754 
rs  7 146951 
rsl  783070 
rsl  251 2622 
rsl  7026006 
rsl  7027976 
rs207S650 
rsl  334496 
rs4B29605 
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Fig.  4.7  Hierarchically  clustered  heatmap  for  the  Mild  Cognitive  Impairment  (MCI)  cohort 


hvNC  <-  heatmap (NC_matj  coL=cm.coLors(256) ,  scaLe=" column" ,  RowSideCoLors  = 
rcNCj  CoLSideCoLors  =  ccNCj  margins  =  c(2}  2 )}  main="NC  Cohort") 

In  the  heatmap  ( )  function,  the  first  argument  provides  the  input  matrix  we 
want  to  use.  col  is  the  color  scheme;  scale  is  a  character  indicating  if  the  values 
should  be  centered  and  scaled  in  either  the  row  direction  or  the  column  direction,  or 
none  ("row",  "column",  and  "none");  RowSideColors  and  ColSideColors 
creates  the  color  names  for  horizontal  side  bars. 

The  differences  between  the  AD,  MCI,  and  NC  heat  maps  are  suggestive  of 
variations  of  genetic  traits  or  alternative  brain  regions  that  may  be  affected  in  the 
three  clinically  different  cohorts. 
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NC  Cohort 


rs2  329647 
r$1 733070 
re2513112 
rsl  7027976 
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rsl  03203 13 
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rs4  336337 
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rsl  7430865 
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rsl  975545 
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rs2 883782 
rsl  57530 
rsl  1 S80332 
rsl  1620374 
rs6569792 
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Fig.  4.8  Hierarchically  clustered  heatmap  for  the  healthy  normal  controls  (NC)  cohort 


4.4  Comparison 


Plots  used  for  comparing  different  individuals,  groups  of  subjects,  or  multiple  units 
represent  another  set  of  popular  exploratory  visualization  tools. 


4.4.1  Paired  Scatter  Plots 

Scatter  plots  use  the  2D  Cartesian  plane  to  display  a  pair  of  variables.  2D  points 
represent  the  values  of  the  two  variables  corresponding  to  the  two  coordinate  axes. 
The  position  of  each  2D  point  on  is  determined  by  the  values  of  the  first  and  second 
variables,  which  represent  the  horizontal  and  vertical  axes.  If  no  clear  dependent 
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Fig.  4.9  Scatter  plot  of 
bivariate  uniform  process 
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variable  exists,  either  variable  can  be  plotted  on  the  X  axis  and  the  corresponding 
scatter  plot  will  illustrate  the  degree  of  correlation  (not  necessarily  causation) 
between  the  two  variables. 

Basic  scatter  plots  can  be  plotted  by  function  plot  (x,  y)  (Fig.  4.9). 

x<-runif(50) 

y<-runif(50) 

pLot(Xj  y3  main=" Scatter  Plot") 

qplot  ( )  is  another  way  to  display  elaborate  scatter  plots.  We  can  manage  the 
colors  and  sizes  of  dots.  The  input  type  for  qplot  ()  is  a  data  frame.  In  the 
following  example,  larger  v  will  have  larger  dot  sizes.  We  also  grouped  the  data  as 
ten  points  per  group  (Fig.  4.10). 

Library(ggpLot2) 

cat  <-  rep(c( "A"j  " B ",  "C",  "D" }  "E")}  10) 

plot.l  <-  qplot (Xj  y}  geom=" point" ,  size=5*Xj  coior=catj  main="GGpLot  with  R 

elative  Dot  Size  and  Color") 

print(plot.l) 

Now,  let’s  draw  a  paired  scatter  plot  with  three  variables.  The  input  type  for 
pairs  ()  function  is  a  matrix  or  data  frame  (Fig.  4.11). 

z<-runif(50) 

pairs (data. frame (x,  y ,  z)) 

We  can  see  that  variable  names  are  on  the  diagonal  of  this  scatter  plot  matrix. 
Each  plot  uses  the  column  variable  as  its  X-axis  and  row  variable  as  its  Y-axis. 

Let’s  see  a  real  word  data  example.  First,  we  can  import  the  Mental  Health 
Services  Survey  Data  into  R,  which  is  on  the  case-studies  website. 
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GGplot  with  Relative  Dot  Size  and  Color 
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Fig.  4.10  Simulated  bubble  plot  depicting  four  variable  features  represented  as  x  and  y  axes,  size 
and  color 
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Fig.  4.11  A  pairs  plot  depicts  the  bivariate  relations  in  multivariate  datasets 
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dotal  <-  read.tabLe( ' https : //umich .instructure . com/ fiLes/399128/down Load ?dow 
nLoad_frd=l ' j  header =T) 
head(datal ) 


## 

ST FIPS 

majorfundtype  FaciLityType  Ownership 

Focus 

PostTraum 

GLBT 

##  1 

southeast 

1 

5 

2 

1 

0 

0 

##  2 

southeast 

3 

5 

3 

1 

0 

0 

##  3 

southeast 

1 

6 

2 

1 

1 

1 

##  4 

greatiakes 

NA 

2 

2 

1 

0 

0 

##  5 

rockymountain 

1 

5 

2 

3 

0 

0 

##  6 

mideast 

NA 

2 

2 

1 

0 

0 

##  num  quaL  supp 


##  1 

5 

NA 

NA 

##  2 

4 

15 

4 

##  3 

9 

15 

NA 

##  4 

7 

14 

6 

##  5 

9 

18 

NA 

##  6 

8 

14 

NA 

attach (datal ) 

From  the  head  ( )  output,  we  observe  that  there  are  a  lot  of  NA’s  in  the  dataset, 
pairs  automatically  deals  with  this  problem  (Figs.  4.12  and  4.13). 

pLot(datal[j  9],  datal [j  10] ,  pch=20j  coL="red" j  main="quaL  vs  supp") 


pairs(datal[j  5:10]) 

Figure  4.12  represents  just  one  of  the  plots  shown  in  the  collage  on  Fig.  4.13.  We 
can  see  that  Focus  and  PostTraum  have  no  relationship  -  Focus  can  equal  to 
3  or  1  in  either  of  the  PostTraum  values  (0  or  1).  On  the  other  hand,  larger  supp 
tends  to  correspond  to  larger  qual  values. 

To  see  this  trend  we  can  also  make  a  plot  using  qplot  function.  This  allow  us  to 
add  a  smooth  model  curve  forecasting  a  possible  trend  (Fig.  4.14). 


Fig.  4.12  Each  of  the 
bivariate  plots  in  a  pairs  plot 
collage  may  be  zoomed  up 
and  explored  further 
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Fig.  4.13  A  more  elaborate  6D  pairs  plot  showing  the  type  and  scale  of  each  variable  and  their 
bivariate  relations 
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Fig.  4.14  Plotting  the  bivariate  trend  along  with  its  confidence  limits 
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pLot.2  <-  qpLot(quoLj  supp}  data  =  datalj  geom  =  c( "point" j  "smooth" ) ) 
print(pLot.2) 

You  can  also  use  the  human  height  and  weight  dataset  or  the  knee  pain  dataset  to 
illustrate  some  interesting  scatter  plots. 


4.4.2  Jitter  Plot 

Jitter  plots  can  help  us  deal  with  the  complexity  issues  when  we  have  many  points  in 
the  data.  The  function  we  will  be  using  is  in  package  ggplot2  is  called 
position_j itter ( )  . 

Let’s  use  the  earthquake  data  for  this  example.  We  will  compare  the  differences 
with  and  without  the  position_j  itter  ( )  function  (Figs.  4.15  and  4.16). 


45.0- 


42.5  - 


40.0- 

Q) 

~o 

3 

"5 

J  37.5- 


35.0- 


32.5  - 


* 


+ 


* 


0 


25 


50 

Depth 


75 


Magt 

Md 

ML 

Mw 

Mx 


100 


Fig.  4.15  Jitter  plot  of  magnitude  type  against  depth  and  latitude  (Earthquake  dataset) 
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Fig.  4.16  A  lower  opacity  jitter  plot  of  magnitude  type  against  depth  and  latitude 


#  library( "xml2" ) ;  library("rvest" ) 

wiki_urL  <-  read_htmL (" http : //wiki . socr . umich.edu/index.php/SOCR_Data_Dinov 

021708_Earthquakes ") 

htmL_nodes ( wiki_urL,  "#content ") 

##  {xmL_nodeset  (1)} 

##  [1]  <div  id=" content"  cLass="m\A/-body -primary"  roie="main" >\n\t<a  id="top 


earthquake  <-  htmi_tabie(htmi_nodes(\Aiiki_uriJ  "tabLe" ) [ [2] ] ) 

pLot6. l<-ggpLot (earthquake j  aes(Depthj  Latitude j  group=Magtj  coLor=Magt) )+ge 
om_point( ) 

pLot6. 2<-ggpLot (earthquake j  aes(Depthj  Latitude j  group=Magtj  coLor=Magt) )+ge 
om_point( position  =  posit ion_ jitter ( w  =  0.3 }  h  =  0.3) j  aLpha=0. 5) 
print(pLot6. 1 ) 


print(pLot6. 2) 

Note  that  with  option  alpha=0 . 5  the  “crowded”  places  are  darker  than  the 
places  with  only  one  data  point.  Sometimes,  we  need  to  add  text  to  these  points,  i.e., 
add  label  in  aes  or  add  geom_text.  The  result  may  look  messy  (Fig.  4.17). 


ggpiot (earthquake j  aes(Depthj  Latitude j  group=Magtj 
coLor=Magtj  LabeL=rownames(earthquake) ) )+ 
geom_point (position  =  position_jitter(iAj  =  0.3 }  h  =  0.3),  aLpha=0.5)  + 
geom_text(  ) 


4.4  Comparison 
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Fig.  4.17  Another  version  of  the  jitter  plot  of  magnitude  type  explicitly  listing  the  Earthquake  ID 
label 


Let’s  try  to  fix  the  overlap  of  points  and  labels.  We  need  to  add 
check_overlap  in  geom_text  and  adjust  the  positions  of  the  text  labels  with 
respect  to  the  points  (Figs.  4.18  and  4.19). 

ggpiot (earthquake j  aes(Depthj  Latitude j  group=Magtj  coLor=Magt, 

Labe L=rownames( earthquake) ))+ 

geom_point( position  =  position_jitter(w  =  0.3 ,  h  =  0.3) ,  aLpha=0.5)+ 
geom_text(check_overLap  =  TjVjust  =  0,  nudge_y  =  0.5}  size  =  2,angLe  =  45) 


#  On  you  can  simply  use  the  text  to  denote  the  positions  of  points. 
ggpLot (earthquake j  aes (Depth }  Latitude j  group=Magtj  coLor=Magt} 

Labe L=rownames (earthquake) ))+ 

geom_text(check_overLap  =  TjVjust  =  0,  nudge_y  =  0,  size  =  3jangLe  =  45) 


4.4.3  Bar  Plots 

Bar  plots,  or  bar  charts,  represent  group  data  with  rectangular  bars.  There  are  many 
variants  of  bar  charts  for  comparison  among  categories.  Typically,  either  horizontal 
or  vertical  bars  are  used  where  one  of  the  axes  shows  the  compared  categories  and 
the  other  axis  represents  a  discrete  value.  It’s  possible,  and  sometimes  desirable,  to 
plot  bar  graphs  including  bars  clustered  by  groups. 
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Fig.  4.18  Yet  another  version  of  the  previous  jitter  plot  illustrating  label  specifications 
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Fig.  4.19  This  jitter  plot  suppresses  the  scatter  point  bubbles  in  favor  of  ID  labels 
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Fig.  4.20  Example  of  a  labeled  boxplot  using  simulated  data  with  grouping  categorical  labels 


In  R,  we  have  the  barplot  ()  function  to  generate  bar  plots.  The  input  for 
barplot  ( )  is  either  a  vector  or  a  matrix  (Fig.  4.20). 


x  <-  matrix (r uni f (50) ,  ncoL=5j  dimnames=List( Letters [1: 10] ,  LETTERS[1: 5] ) ) 
x 


##  A 

##  a  0. 64397479  0 
##  b  0.21981304  0 
##  c  0.08903728  0 
##  d  0.13075121  0 
##  e  0.87938851  0 
##  f  0.65204025  0 
##  g  0.02814806  0 
##  h  0.13106307  0 
##  i  0.15759514  0 
##  j  0.47347613  0 


B  C 
75069788  0.4859278 
84028392  0.7489431 
87540556  0.2656034 
01106876  0.7586781 
04156918  0.1960069 
21135891  0.3774320 
72618285  0.5603189 
79411904  0.4526415 
63369297  0.8861631 
14976052  0.5887866 


D 

0.068299279 
0.130542241 
0.146773063 
0.860316695 
0.949276015 
0.896443296 
0.113651731 
0. 793385952 
0.004317772 
0.698139910 


E 

0.5069665 

0.2694441 

0.6346498 

0.9976566 

0.5050743 

0.9332330 

0.1912089 

0.4847625 

0.6341256 

0.2023031 


barpLot(x[l:4j  J,  yLim=c(0j  max(x[l:4}  ])+0.3)j  beside=TRUEj  Legend. text  =  L 
etters[l :4]j 

args . Legend  =  List(x  =  "topLeft" ) ) 

text ( Labe Ls=round (as. vector (as. matrix(x[l:4j  ])),  2),  x=seq(1.5j  21 }  by=l)  + 
rep(c(0j  lj  2j  3j  4 ),  each=4)j  y=as. vector (as. matrix(x[l:4}  ]))+0.1) 


It  may  require  some  creativity  to  add  value  labels  on  each  bar.  First,  let’s  specify 
the  location  on  the  x-axis  x=seq  (1.5,  21,  by=l )  +  rep  ( c  ( 0  ,  1 ,  2,  3,  4)  , 
each=4 ) .  In  this  example  there  are  20  bars.  The  x  location  for  middle  of  the  first 
bar  is  1.5  (there  is  one  empty  space  before  the  first  bar).  The  middle  of  the  last  bar  is 
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Fig.  4.21  Statistical  barplot 
showing  point-estimates  and 
their  error  limits  (simulated 
data) 
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24.5.  seq  (1.5,  21,  by=l )  starts  at  1.5  and  creates  20  bars  that  end  with  x=2 1. 
Then,  we  use  rep  ( c  ( 0  ,  1,  2,  3,  4),  each=4 )  to  add  0  to  the  first  group,  1  to 
the  second  group,  and  so  forth.  Thus,  we  have  the  desired  positions  on  the  x-axis. 
The  y-axis  positions  are  obtained  just  by  adding  0. 1  to  each  bar  height. 

We  can  also  add  standard  deviations  to  the  means  on  the  bars.  To  do  this,  we  need 
to  use  the  arrows  ( )  function  and  the  option  angle  =  90,  the  result  is  shown  on 
Fig.  4.21. 

bar  <-  barpLot(m  <-  rowMeans(x)  *  10 ,  ylim=c(0j  10)) 
stdev  <-  sd(t(x[l :4j  ])) 

arrous(barj  rrij  barj  m  +  stdev j  Length=0.15j  angle  =  90) 

Let’s  look  at  a  more  complex  example.  We  will  utilize  the  dataset 
Case_04_ChildTrauma  for  illustration.  This  case  study  examines  associations 
between  post-traumatic  psychopathology  and  service  utilization  by  trauma-exposed 
children. 


data2  <-  read.tabLe(  '  https  ://umich  .instructure,  com/ files/399129/down Load ?dow 

nLoad_frd=l ' j  header =T) 

attach(data2) 

head(data2) 


## 

id 

sex 

age 

ses  race 

traumatype  ptsd 

dissoc 

service 

##  1 

1 

1 

6 

0  black 

sexabuse 

1 

1 

17 

##  2 

2 

1 

14 

0  black 

sexabuse 

0 

0 

12 

##  3 

3 

0 

6 

0  black 

sexabuse 

0 

1 

9 

##  4 

4 

0 

11 

0  black 

sexabuse 

0 

1 

11 

##  5 

5 

1 

7 

0  black 

sexabuse 

1 

1 

15 

##  6 

6 

0 

9 

0  black 

sexabuse 

1 

0 

6 

We  have  two  character  variables.  Our  goal  is  to  draw  a  bar  plot  comparing  the 
means  of  age  and  service  among  different  races  in  this  study,  and  we  want  to  add 
standard  deviation  to  each  bar.  The  first  thing  to  do  is  to  delete  the  two  character 
columns.  Remember,  the  input  for  barplot  ( )  is  a  numerical  vector  or  a  matrix. 
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However,  we  will  need  race  information  for  the  categorical  classification.  Thus,  we 
will  store  race  in  a  different  variable. 

doto2.sub  <-  data2[j  c(-5}  -6)] 
doto2<-dota2[ j  -6] 

We  are  now  ready  to  separate  the  groups  and  compute  the  group  means. 

data 2. matrix  <-  as. data. frame (data2) 

Blacks  <-  data2[which(data2$race=="black" ) j  ] 

Other  <-  data2[which(data2$race=="other" ) j  ] 

Hispanic  <-  data2[which(data2$race=="hispanic" ) ,  ] 

White  <-  data2 [which (data2$race== "white" ) j  ] 

B  <-  c(mean(BLacks$age) j  mean(BLacks$service) ) 

0  <-  c(mean(Other$age)j  mean(Other$service) ) 

H  <-  c(mean(Hispanic$age) j  mean(Hispanic$service) ) 

W  <-  c(mean(White$age) j  mean(White$service) ) 

x  <-  cbind(Bj  0 ,  H,  W) 
x 

##  BOH  IaI 

##  [lj]  9.165  9.12  8.67  8.950000 

##  [2}]  9.930  10.32  9.61  9.911667 

Now  that  we  have  a  numerical  matrix  for  the  means,  we  can  compute  a  second 
order  statistics,  standard  deviation,  and  plot  it  along  with  the  means,  to  illustrate  the 
amount  of  dispersion  for  each  variable  (Fig.  4.22). 

bar  <-  barplot(Xj  ylim=c(0j  max(x)+2.0)j  beside=TRUEj 

legend .text  =  c("age"j  "service" )  ,  args .  Legend  =  List(x  =  "right")) 

text( Labe Ls=round (as. vector (as. matrix(x) ),  2) , 

x=seq(1.4j  21j  by=1.5)j  #y=as . vector(as .matrix(x[l : 2S  ]))+0.3) 
y=ll . 5 ) 

m  <-  x;  stdev  <-  sd(t(x)) 

arrows(barj  mj  bar ,  m  +  stdev j  Length=0. 15 }  angle  =  90) 


Fig.  4.22  Barplot  showing 
point-estimates  and  their 
error  limits  (Child  Trauma 
dataset) 
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Here,  we  want  the  y  margin  to  be  a  little  higher  than  the  greatest  value  (yl  im=c 
( 0 ,  max  (x)  +2 . 0 ) )  because  we  need  to  leave  space  for  value  labels.  The  plot 
shows  that  Hispanic  trauma-exposed  children  may  be  younger,  in  terms  of  average 
age,  and  less  likely  to  utilize  services  like  primary  care,  emergency  room,  outpatient 
therapy,  outpatient  psychiatrist,  etc. 

Another  way  to  plot  bar  plots  is  to  use  ggplot  ( )  in  the  ggplot  package.  This 
kind  of  bar  plot  is  quite  different  from  the  one  we  introduced  previously.  It  displays 
the  counts  of  character  variables  rather  than  the  means  of  numerical  variables.  It 
takes  the  values  from  a  data,  frame.  Unlike  barplot  () ,  drawing  bar  plots 
using  ggplot 2  requires  that  the  character  variables  remain  in  the  original  data 
frame  (Fig.  4.23). 


Library (ggpLot2) 

data2  <-  read.tabLe(  '  https  ://umich  .instructure.  com/fiLes/399129/do\AinLoad?dow 
nLoad_frd=l ' j  header=T) 

barl  <-  ggpLot(data2j  aes(racej  fiLL=race) )  +  geom_bar( )+ 
facet_grid( .  ~  traumatype) 
print (barl ) 

This  plot  helps  us  compare  the  occurrence  of  different  types  of  child-trauma 
among  different  races. 


4.4.4  Trees  and  Graphs 


In  general,  a  graph  is  an  ordered  pair  G  =  (V,  E)  of  vertices  (V),  i.e.,  nodes  or  points, 
and  edges  (. E ),  arcs  or  lines  connecting  pairs  of  nodes  in  V.  A  tree  is  a  special  type  of 
acyclic  graph  that  does  not  include  looping  paths.  Visualization  of  graphs  is  critical 
in  many  biosocial  and  health  studies,  and  we  will  see  menu  such  examples  through¬ 
out  this  textbook. 

In  Chaps.  10  and  13,  we  will  learn  more  about  how  to  build  tree  models  and  other 
clustering  methods,  and  in  Chap.  23,  we  will  discuss  deep  learning  and  neural 
networks,  which  have  a  direct  graphical  representation. 

This  section  will  be  focused  on  displaying  tree  graphs.  We  will  use  a  self-efficacy 
study,  02_Nofl_Data.csv,  for  this  demonstration. 


data3<-  read. tabie(  'https : //umich .instructure . com/ fiLes/330385/downLoad?down 

Load_frd=l"j  sep="j"j  header  =  TRUE) 

head(data3) 

##  ID  Day  Tx  SeLfEff  SeLfEff25  Ia/PSS  SocSuppt  PMss  PMss3  PhyAct 


## 

1 

1 

1 

1 

33 

8 

0.97 

5.00 

4.03 

1.03 

53 

## 

2 

1 

2 

1 

33 

8 

-0.17 

3.87 

4.03 

1.03 

73 

## 

3 

1 

3 

0 

33 

8 

0.81 

4.84 

4.03 

1.03 

23 

## 

4 

1 

4 

0 

33 

8 

-0.41 

3.62 

4.03 

1.03 

36 

## 

5 

1 

5 

1 

33 

8 

0.59 

4.62 

4.03 

1.03 

21 

## 

6 

1 

6 

1 

33 

8 

-1.16 

2.87 

4.03 

1.03 

0 
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Fig.  4.24  Hierarchical 
clustering  dendrogram  of 
the  900  self-efficacy  records 
of  30  participants  including 
the  nine  features  tracked 
over  a  month 


We  will  use  he  lust  to  build  the  hierarchical  cluster  model,  he  lust  takes  only 
inputs  that  have  dissimilarity  structure  as  produced  by  dist  ( ) .  Also,  we  use  the 
ave  method  for  agglomeration,  see  the  tree  graph  on  Fig.  4.24. 

hc<-hcLust(dist(data3)j  method= 'ave' ) 

par  (mfro\Aj=c(lj  1)) 

pLot(hc) 

When  we  specify  no  limit  for  the  maximum  cluster  groups,  we  will  get  the  graph, 
on  Fig.  4.24,  which  is  not  easy  to  interpret.  Luckily,  cutree  will  help  us  limit  the 
cluster  numbers,  cutree  ( )  takes  a  he  lust  object  and  returns  a  vector  of  group 
indicators  for  all  observations. 


require (graphics ) 

mem  <-  cutree(hCj  k  =  10) 

#  mem;  #  to  print  the  hierarchical  tree  labels  for  each  case 

#  which(mem==5)  #  to  identify  which  cases  belong  to  class/cluster  5 
#To  see  the  number  of  Subjects  in  which  cluster: 

#  table(cutree(hc ,  k=5)) 

Usinf  a  for  loop,  we  can  get  the  mean  of  each  variable  within  groups. 

cent  <-  NULL 
for(k  in  1:10){ 

cent  <-  rbind(centj  coLMeans(data3[mem  ==  k,  ,  drop  =  FALSE])) 

} 


Now,  we  can  plot  a  new  tree  graph  with  ten  groups.  Using  the 
member s= table  (mem)  option,  the  matrix  is  taken  to  be  a  dissimilarity  matrix 
between  clusters,  instead  of  dissimilarities  between  singletons,  and  members  repre¬ 
sents  the  number  of  observations  per  cluster  (Fig.  4.25). 

hcl  <-  he  Lust (dist (cent) ,  method  =  "ave"j  members  =  table (mem)) 
pLot(hclj  hang  =  -1}  main  =  "Re-start  from  10  clusters" ) 


4.4  Comparison 
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Fig.  4.25  A  ten-cluster 
hierarchical  dendrogram  of 
the  same  dataset  as  before 
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4.4.5  Correlation  Plots 


The  corrplot  package  enables  the  graphical  display  of  correlation  matrices, 
confidence  intervals,  and  other  plots  showing  matrix  reordering.  There  are  seven 
visualization  methods  (parameter  methods)  in  corrplot  package,  named  "circle", 
"square",  "ellipse",  "number",  "shade",  "color",  and  "pie". 

Let’s  use  03_NC_SNP_ROI_Assoc_P_values.csv  again  to  investigate  the  asso¬ 
ciations  among  SNPs  using  correlation  plots. 

The  corrplot  ( )  function  we  will  be  using  only  accepts  correlation  matrices. 
So,  we  need  to  first  obtain  the  correlation  matrix  of  our  data  first  using  the  cor  ( ) 
function. 


#  install . packages( "corrplot" ) 

Library ( corrplot) 

NC_Associations_Data  <-  read .table (  'https : //umich . instructure . com/files/3303 
91/download?download_frd=l"j  header=TRUEj  row.names=lj  sep="j"j  dec=".") 

M  <-  cor(NC_Associations_Data) 


M[l:10j 

1:10] 

## 

P2 

P5 

P9 

P12 

P13 

## 

P2 

1 . 00000000 

-0.05976123 

0.99999944 

-0.05976123 

0.21245299 

## 

PS 

-0.05976123 

1 . 00000000 

-0.05976131 

-0.02857143 

0.56024640 

## 

P9 

0.99999944 

-0.05976131 

1 . 00000000 

-0.05976131 

0.21248635 

## 

PI  2 

-0.05976123 

-0.02857143 

-0.05976131 

1 . 00000000 

-0.05096471 

## 

PI  3 

0.21245299 

0.56024640 

0.21248635 

-0.05096471 

1 . 00000000 

## 

P14 

-0.05976123 

1 . 00000000 

-0.05976131 

-0.02857143 

0.56024640 

## 

PI  5 

-0.08574886 

0.69821536 

-0.08574898 

-0.04099594 

0.36613665 

## 

PI  6 

-0.08574886 

0.69821536 

-0.08574898 

-0.04099594 

0.36613665 

## 

PI  7 

-0.05976123 

-0.02857143 

-0.05976131 

-0.02857143 

-0.05096471 

## 

PI  8 

-0.05976123 

-0.02857143 

-0.05976131 

-0.02857143 

-0.05096471 

## 

P14 

PI  5 

P16 

P17 

P18 

## 

P2 

-0.05976123 

-0.08574886 

-0.08574886 

-0.05976123 

-0.05976123 

## 

PS 

1 . 00000000 

0.69821536 

0.69821536 

-0.02857143 

-0.02857143 

## 

P9 

-0.05976131 

-0.08574898 

-0.08574898 

-0.05976131 

-0.05976131 

## 

PI  2 

-0.02857143 

-0.04099594 

-0.04099594 

-0.02857143 

-0.02857143 

## 

P13 

0.56024640 

0.36613665 

0.36613665 

-0.05096471 

-0.05096471 

## 

P14 

1 . 00000000 

0.69821536 

0.69821536 

-0.02857143 

-0.02857143 

## 

PI  5 

0.69821536 

1 . 00000000 

1 . 00000000 

-0.04099594 

-0.04099594 

## 

PI  6 

0.69821536 

1 . 00000000 

1 . 00000000 

-0.04099594 

-0.04099594 

## 

PI  7 

-0.02857143 

-0.04099594 

-0.04099594 

1 . 00000000 

-0.02857143 

## 

PI  8 

-0.02857143 

-0.04099594 

-0.04099594 

-0.02857143 

1 . 00000000 
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4  Data  Visualization 


We  will  illustrate  alternative  correlation  plots  using  the  corrplot  function  in 
Figs.  4.26,  4.27,  4.28,  4.29,  4.30,  and  4.31. 

corrpLot(Mj  method  =  " circle ",  title  =  "circle" j  tl.cex  =  0.5}  tl.col  =  'bio 
ch'j  mar=c(lj  1>  1}  1)) 

#  par  specs  c(bottom,  left,  top,  right)  which  gives  the  margin  size 
specified  in  inches 

corrplot (Mj  method  =  " square" ,  title  =  " square" ,  tl.cex  =  0.5j  tl.col  = 

'black  ,  mar=c(l,  1,  1,  1 )) 

corrplot(Mj  method  =  "ellipse" ,  title  =  "ellipse" ,  tl.cex  =  0.5}  tl.col  = 
'black',  mar=c(l,  1,  1,  1 )) 

corrplot  (Mj  method  =  "pie"j  title  =  "pie"j  tl.cex  =  0.5j  tl.col  =  '  black 'j 
mar=c(l,  1,  1,  1 )) 

corrplot (Mj  type  =  "upper" ,  tl.pos  =  "td"j 

method  =  "circle" ,  tl.cex  =  0.5 ,  tl.col  =  ' black 

order  =  "he Lust1 .  diaa  =  FALSE.  mar=c(l.  1.  0.  1 )) 
corrplot .mixed (Mj  number . cex  =  0.6 ,  tl.cex  =  0.6) 
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Fig.  4.26  Correlation  plot  of  regional  brain  volumes  of  the  healthy  normal  controls  using  circles 
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Fig.  4.27 


The  same  correlation  plot  of  regional  NC  brain  volumes  using  squares 
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Fig.  4.28  The  same  correlation  plot  of  regional  NC  brain  volumes  using  ellipses 
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Fig.  4.29  The  same  correlation  plot  of  regional  NC  brain  volumes  using  pie  segments 
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Fig.  4.30  Upper  diagonal  correlation  plot  of  regional  NC  brain  volumes  using  circles 
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Fig.  4.31 


Mixed  correlation  plot  of  regional  NC  brain  volumes  using  circles  and  numbers 


In  the  figures  above,  different  shades  of  colors  represent  low-and-high  correla¬ 
tions  of  the  two  variables  corresponding  to  the  x  and  y  indices. 


4.5  Relationships 

4.5.1  Line  Plots  Using  ggplot 

Line  charts  display  a  series  of  data  points  (e.g.,  observed  intensities  (7)  over  time 
( X ))  by  connecting  them  with  straight-line  segments.  These  can  be  used  to  either 
track  temporal  changes  of  a  process  or  compare  the  trajectories  of  multiple  cases, 
time  series,  or  subjects  over  time,  space,  or  state. 

In  this  section,  we  will  utilize  the  Earthquakes  dataset  on  SOCR  website.  It 
records  information  about  earthquakes  that  occurred  between  1969  and  2007  with 
magnitudes  larger  than  5  on  the  Richter  scale. 
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4  Data  Visualization 


#  library("xml2" ) ;  library( "rvest" ) 

wiki_urL  <-  read_htmL  (" http : //wiki .  socr .  umich.edu/index.php/SOCR_Data_Dinov_ 

021708_Earthquakes ") 

htmi_nodes ( wiki_urL,  "#content ") 

##  {xmL_nodeset  (1)} 

##  [1]  <div  id=" content"  c Lass="mw- body -primary "  roie="main" >\n\t<a  id="top 

•  •  • 

earthquahe<-  htmi_tabie(htmi_nodes(\Aiiki_uriJ  "tabLe") [ [2] ] ) 

In  this  dataset,  we  group  the  data  by  Magt  (magnitude  type).  We  will  draw  a 
“Depth  vs.  Latitude”  line  plot  from  this  dataset.  The  function  we  are  using  is  called 
ggplot  ( )  under  ggplot2.  The  input  type  for  this  function  is  a  data  frame  and 
aes  ()  specifies  aesthetic  mappings  of  how  variables  in  the  data  are  mapped  to 
visual  properties  (aesthetics)  of  the  geom  objects,  e.g.,  lines  (Fig.  4.32). 

Library(ggpLot2) 

piot4< -ggplot (earthquake ,  aes(Depth,  Latitude,  group=Magt,  coLor=Magt) )+ 

geom_Line() 

print(pLot4) 

There  are  two  important  components  in  the  script.  The  first  part,  ggplot 
(earthquake,  aes  (Depth,  Latitude,  group=Magt, 

color=Magt)  ) ,  specifies  the  setting  of  the  plot:  dataset,  group,  and  color.  The 
second  part  specifies  that  we  are  going  to  draw  lines  between  data  points.  In  later 
chapters,  we  will  frequently  use  package  ggplot 2  whose  generic  structure  always 
involves  concatenating  function  calls  like  funct  ionl  +  function2+.... 


45.0 


32  5 

0  25  60  75  100 

Depth 


Fig.  4.32  Line  plot  of  Earthquake  magnitude  type  by  its  ground  depth  and  latitude 
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4.5.2  Density  Plots 

We  can  visualize  the  distribution  of  different  variables  using  density  plots. 

The  following  segment  of  R  code  plots  the  distribution  for  latitude  among 
different  earthquake  magnitude  types.  Also,  it  uses  the  ggplot  ( )  function  com¬ 
bined  with  geom_density  ( )  (Fig.  4.33). 


#  library( "ggplot2" ) 

pLot5< -ggplot (earthquake ,  aes (Latitude ,  group=Magtj  newsize=2)  )  + 
geom_density(aes(coLor=Magt)j  size  =  2)  + 
theme (Legend .position  =  ' right 'j 

Legend. text  =  eLement_text(coLor=  'bLack'j  size  =  12j  face  =  'boLd')j 
Legend . key  =  eLement_rect(size  =  0.5 ,  Linetype=' soLid' ) , 

Legend . key . size  =  unit(1.5}  'Lines')) 
print(pLot5) 

#  table(earthquake$Magt)  #  to  see  the  distribution  of  magnitude  types 

Note  the  green  magt  type  (Local  (ML)  earthquakes)  peaks  at  latitude  37.5,  which 
represents  37-38°  North,  near  San  Francisco,  California. 


4.5.3  Distributions 

Probability  distribution  plots  depict  the  characteristics  of  the  underlying  process  that 
can  be  used  to  contrast  and  compare  the  shapes  of  distributions  as  proxy  of  the 
corresponding  natural  phenomena.  For  univariate,  bivariate,  and  multivariate 
processes,  the  distribution  plots  are  drawn  as  curves,  surfaces,  or  manifolds,  respec¬ 
tively.  These  plots  may  be  used  to  inspect  areas  under  the  distribution  plot  that 
correspond  to  either  probabilities  or  data  values.  The  Distributome  Cauchy  distribu¬ 
tion  calculator  and  the  SOCR  2D  bivariate  Normal  Distribution  plot  provide  simple 
examples  of  distribution  plots  in  ID  and  2D,  respectively  (Fig.  4.34). 


Fig.  4.33  Density  plot  of  Earthquakes  according  to  their  magnitude  types  and  latitude  location 
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4  Data  Visualization 


Fig.  4.34  Univariate  and 
bivariate  probability 
distribution  calculators 
(Distributome  Project) 


4.5.4  2D  Kernel  Density  and  3D  Surface  Plots 

Density  estimation  is  the  process  of  using  observed  data  to  compute  an  estimate  of 
the  underlying  process’  probability  density  function.  There  are  several  approaches  to 
obtain  density  estimation,  but  the  most  basic  technique  is  to  use  a  rescaled  histogram. 

Plotting  2D  Kernel  Density  and  3D  Surface  plots  is  very  important  and  useful  in 
multivariate  exploratory  data  analytics. 

We  will  use  plot_ly  ()  function  under  plotly  package,  which  requires  a 
data  frame  input. 

To  create  a  surface  plot,  we  use  two  vectors:  v  and  y  with  length  m  and 
n  respectively.  We  also  need  a  matrix:  z  of  size  m  x  n.  This  z  matrix  is  created 
from  matrix  multiplication  between  v  and  y. 

To  plot  the  2D  Kernel  Density  estimation  plot  we  will  use  the  eruptions  data  from 
the  “Old  Faithful”  geyser  in  Yellowstone  National  Park,  Wyoming,  stored  in  R  as 
geyser.  Also,  the  kde2d  ( )  function  is  needed  for  2D  kernel  density  estimation. 
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hd  <-  with  (MASS:  '.geyser }  MASS: :  hde2d(  duration  j  waitingj  n  =  50)) 
kd$x[l:5] 

##  [1]  0.8333333  0.9275510  1.0217687  1.1159864  1.2102041 
kd$y[l:5] 


##  [1]  43.00000  44.32653  45.65306  46.97959  48.30612 
kd$z[l:5j  1:5] 


##  [A]  [,2] 
##  [lj]  9. 068691e-13  4.238943e-12 
##  [2/]  1 . 814923e-12  8.473636e-12 
##  [3/]  3.428664e-12  1.599235e-ll 
##  [4}]  6. 114498e-12  2.849475e-ll 
##  [ 5 ,  ]  1 . 029643e-ll  4.793481e-ll 


[j3] 

1 . 839285e-ll 
3. 671290e-ll 
6. 920273e-ll 
1 . 231748e-10 
2.  070127e-10 


MJ 

7.415672e-ll 
1 . 477410e-10 
2. 7 80463 e- 10 
4. 942437e-10 
8. 297218e-10 


[,5] 

2. 781459e-10 
5. 528260e-10 
1 . 038314e-09 
1. 842547 e- 09 
3. 088867 e- 09 


Here  z=t  (x)  %*%y  and  we  apply  plot_ly  to  the  list  kd  via  the  with  () 
function  (Fig.  4.35). 


Library (p Lot Ly) 

with(kd,  pLot_Ly(x=Xj  y=y,  z=z}  type=" surface")) 

Note  that  we  used  the  option  "  surface " . 

For  3D  surfaces,  we  have  a  built-in  dataset  in  R  called  volcano.  It  records  the 
volcano  height  at  location  x,  y  (longitude,  latitude).  Because  z  is  always  made  from 
v  and  y,  we  can  simply  specify  z  to  get  the  complete  surface  plot  (Fig.  4.36). 

z 
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Fig.  4.35  Interactive  surface  plot  of  kernel  density  for  the  Old  Faithful  geyser  eruptions 
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Fig.  4.36  Interactive  surface  plot  of  kernel  density  for  the  R  volcano  dataset 


voLcano[l : 10j  1:10] 


## 

[, i ] 

1,2] 

[j3] 

[A] 
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[,  7] 
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## 
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[3,] 

102 

102 

103 

103 

103 

103 

103 

102 

102 

102 

## 

[4,] 
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## 

[9,] 

107 

108 

108 

109 

109 

109 

109 

108 

108 

107 

## 
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pLot_Ly(z=voLcanOj  type=" surface") 


4.5.5  Multiple  2D  Image  Surface  Plots 


Let’s  look  at  another  example  using  a  2D  brain  image  (Fig.  4.37). 
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1.5 


Fig.  4.37  Interactive  surface  plot  of  kernel  density  for  the  2D  brain  imaging  data 


#install.packages(" jpeg")  ##  if  necessary 
Library(jpeg) 

#  Get  an  image  file  downloaded  (default:  MRI_ImageHematoma.jpg) 

img_uri  <-  "https : //umich . instructure . com/ files/ 1627 149 /down Load ?download_fr 

d=l " 

img_fiLe  <-  tempfiLe();  download. file (imgurLj  img_fiLej  mode="wb") 
img  <-  readJPEG(img_fiLe) 
file, inf o( img_fiLe) 

f  He.  remove (img_f  He)  #  cleanup 

##  [1]  TRUE 

img  <-  img[j  ,  1]  #  extract  the  first  channel  (from  RGB  intensity  spectrum) 
as  a  univariate  2D  array 


#  install . packages( "spatstat" ) 

#  package  spatstat  has  a  function  blur()  that  applies  a  Gaussian  blur 
Library (spatstat) 

img_s  <-  as.matrix(bLur(as .im(img) j  sigma=10) )  #  the  smoothed  version  of  the 
image 

z2  <-  img_s  +  1  #  abs(rnorm(l,  1,  1))  #  Upper  confidence  surface 

z3  <-  img_s  -  1  #  abs(rnorm(l,  1,  1))  #  Lower  confidence  limit 


#  Plot  the  image  surfaces 

p  <-  pLot_Ly (z=img }  type=" surface" ,  showscaie=FALSE)  %>% 
add_trace(z=z2j  type=" surf  ace" }  showscaLe=FALSEj  opacity=0.98)  %>% 
add_trace(z=z3j  type=" surf ace" j  shoiAjscaLe=FALSEj  opacity=0 .98) 
p  #  Plot  the  mean-surface  along  with  lower  and  upper  confidence  services. 
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http://socr.umich.edu/ 
HTML5/Brain  Viewer/ 


Fig.  4.38  Live  demo:  interactive  brain  viewer 


The  DSPA  Online  appendix  provides  additional  details  on  shape  representation, 
modeling,  and  computing  on  surfaces  and  manifolds. 


4.5.6  3D  and  4D  Visualizations 

Many  datasets  have  intrinsic  multi-dimensional  characteristics.  For  instance,  the 
human  body  is  a  3D  solid  of  matter  (three  spatial  dimensions  can  be  used  to  describe 
the  position  of  every  component,  e.g.,  sMRI  volume)  that  changes  over  time  (the 
fourth  dimension,  e.g.,  fMRI  hypervolumes). 

The  SOCR  BrainViewer  shows  how  to  use  a  web-browser  to  visualize  2D  cross- 
sections  of  3D  volumes,  display  volume-rendering,  and  show  ID  (e.g.,  1 -manifold 
curves  embedded  in  3D)  and  2D  (e.g.,  surfaces,  shapes)  models  jointly  into  the  same 
3D  scene  (Fig.  4.38). 

We  will  now  illustrate  an  example  of  3D/4D  visualization  in  R  using  the  packages 
brainR  and  rgl. 

#  install. packages("brainR")  ##  if  necessary 
require (brainR ) 

#  Test  data:  http://socr.umich.edu/HTNL5/BrainViewer/data/TestBrain.nii.gz 

brainURL  <-  "http ://socr. umich.edu/HTML5/BrainViewer/data/TestBrain .nii.gz" 
brainFiLe  <-  fiLe.path(tempdir()j  "TestBrain . nii.gz") 
download, file (brainURL j  dest=brainFilej  quiet=TRUE) 
brainVoLume  <-  readNIfTI(brainFilej  reorient=FALSE) 

brainVoLDims  <-  dim(brainVoLume) ;  brainVoLDims 
##  [1]  181  217  181 

#  try  different  levels  at  which  to  construct  contour  surfaces  (10  fast) 

#  lower  values  yield  smoother  surfaces  #  see  ?contour3d 
contour3d( brainVoLume j  Level  =  20 ,  alpha  =  0.1,  draw  =  TRUE) 

#  multiple  levels  may  be  used  to  show  multiple  shells 

#  "activations"  or  surfaces  like  hyper-intense  white  matter 

#  This  will  take  1-2  minutes  to  rend! 

contour3d(brainVoLumej  Level  =  c(10,  120),  alpha  =  c(0.3,  0.5) , 
add  =  TRUEj  coLor=c( "yellow" }  "red")) 
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#  create  text  for  orientation  of  right/left 

text3d(x=brainVoLDims [l]/2j  y=brainVoLDims [2]/2}  z  =  brainVoLDims[3]*0. 98 } 
text="Top ") 

text3d(x=brainVoLDims [1] *0.98j  y=brainVoLDims [2]/2}  z  =  brainVoLDims[3]/2j 
text=" Right") 

###  render  this  on  a  webpage  and  view  it! 

#browsellRL(  paste  ( "f  ile:  /  /" , 

#  writel/\lebGL_split  (dir=  f  ile .  path(tempdir  ( )  /'webGL" ) , 

#  template  =  system. f ile( "my_template . html" ,  package="brainR" ) , 

#  width=500),  sep="" )) 


For  4D  fMRI  time-series,  we  can  load  the  hypervolumes  similarly  and  then 
display  them  (Figs.  4.39,  4.40,  4.41,  4.42,  4.43,  and  4.44): 


Fig.  4.39  2D  cross-sectional  (axial)  views  of  the  4D  fMRI  data 
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4  Data  Visualization 


Fig.  4.40  Histogram  plot  of 
the  fMRI  intensities 
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Fig.  4.41  Coronal 
(top-left),  sagittal  (to-right), 
and  axial  (bottom-left) 
views  of  the  4D  fMRI  data 


Fig.  4.42  Truncated 
histogram  of  the  fMRI 
hyper-volume  intensities 
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Fig.  4.43  Intensities  of  the  fifth  timepoint  epoch  of  the  4D  fMRI  time  series 
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Fig.  4.44  The  complete  time  course  of  the  raw  (blue)  and  two  smoothed  versions  of  the  fMRI 
timeseries  at  one  specific  voxel  location  (30,  30,  15) 
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4  Data  Visualization 


#  See  examples  here:  https://cran.r-project.org/web/packages/oro.nifti/vigne 
ttes/nifti . pdf 

#  and  here:  http : //journals . plos .org/plosone/article?id=10. 1371/journal . pone 
.0089470 


fMRIURL  <-  "http: / /soar . umich.edu/HTML5/BrainViewer/data/fMRI_FiLteredData_4 
D. nii.gz " 

fMRIFiLe  <-  fiLe . path(tempdir( ) ,  "fMRI_FiLteredData_4D. nii.gz" ) 
downLoad. fiLe( fMRIURL j  dest=fMRIFiLeJ  quiet=TRUE) 

(fMRIVoLume  <-  readNIfTI (fMRIFiLe ,  reorient=FALSE) ) 


##  NIfTI-1  format 

##  Type 

##  Data  Type 

##  Bits  per  PixeL 

##  SLice  Code 
##  Intent  Code 
##  Qform  Code 

##  Sform  Code 

##  Dimension 
##  PixeL  Dimension 

##  VoxeL  Units 

##  Time  Units 


nifti 
4  (INTI  6) 

16 

0  (Unknown) 

0  (None) 

1  (Scanner_Anat) 

0  (Unknown) 

64  x  64  x  21  x  180 
4  x  4  x  6  x  3 
mm 
sec 


#  dimensions:  64  x  64  x  21  x  180  ;  4mm  x  4mm  x  6mm  x  3  sec 


fMRIVoLDims  <-  dim( fMRIVoLume) ;  fMRIVoLDims 
##  [1]  64  64  21  180 

time_dim  <-  fMRIVoLDims [4] ;  time_dim 
##  [1]  180 

#  Plot  the  4D  array  of  imaging  data  in  a  5x5  grid  of  images 

#  The  first  three  dimensions  are  spatial  locations  of  the  voxel  (volume  elem 
ent)  and  the  fourth  dimension  is  time  for  this  functional  MRI  (fMRI)  acquisi 
tion . 

image ( fMRIVo L ume,  z L im= range ( fMRIVo L ume )*0.95) 


hist (fMRIVoLume ) 


#  Plot  an  orthographic  display  of  the  fMRI  data  using  the  axial  plane 

#  containing  the  left-and-right  thalamus  to  approximately  center 

#  the  crosshair  vertically 

orthographic ( fMRIVoLume j  xyz=c(34} 29 } 10) }  z Lim=range (fMRIVo Lume ) *0 . 9 ) 


stat_fmri_test  <-  if  else (fMRIVoLume  >  15000 }  fMRIVoLume j  NA) 
hist ( stat_fmri_test ) 


dim(  stat_fmri_test) 

##  [1]  64  64  21  180 

over  Lay ( fMRIVoLume j  fMRIVoLume[ j , ,  5]}  zLim.x=range( fMRIVo Lume) *0. 95) 
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#  overlay(fMRIVolume,  stat_fmri_test [ , , , 5] ,  zlim.x=range(fMRIVolume)*0.95) 

#  To  examine  the  time  course  of  a  specific  3D  voxel  (say  the  one  at  x=30,  y= 
30,  z=15) : 

plot(fMRIVolume[30,  30,  10j]j  type='L',  main="Time  Series  of  3D  Voxel  \n  (x= 
30,  y=30,  z=15 )",  coL="bLue") 

xl  <-  c(l :180) 

yl  <-  Loess (fMRIVolume [30,  30,  10,  ]~  xl,  family  =  "gaussian") 

Lines (xl,  smooth (fMRIVolume [30,  30,  10,]),  col  =  "red",  Lwd  =  2) 

Lines (ksmooth(xl,  fMRIVolume [30,  30,  10,],  kernel  =  "normal",  bandwidth  =5) 

,  col  =  "green",  Lwd  =  3) 

Chapter  19  provides  more  details  about  longitudinal  and  time-series  data 
analysis. 


4.6  Appendix 

4.6.1  Hands-on  Activity  (Health  Behavior  Risks) 


#  load  data  CaseStudy09_HealthBehaviorRisks_Data 

data_2  <-  read. csv( "https : //umich . instructure . com/files/602090/download?down 
Load_frd=l ",  sep=",",  header  =  TRUE) 

Classify  the  cases  using  these  variables:  "AGE_G"  "SEX"  "RACEGR3" 
"IMPEDUC"  "IMPMRTL"  "EMPLOY  1"  "INCOMG"  "CVDINFR4" 
"CVDCRHD4"  "CVDSTRK3"  "DIABETE3"  "RFSMOK3"  "FRTLT1"  "VEGLT1" 

data. raw  <-  data_2[,  -c(l,  14,  17)] 

#  Does  the  classification  match  either  of  these: 

#  TOTINDA  (Leisure  time  physical  activities  per  month,  l=Yes,  2=No, 

9=Don't  know/Ref used/Missing) 

#  RFDRHV4  (Heavy  alcohol  consumption,  l=No,  2=Yes, 

9=Don't  know/Refused/Nissing) 

he  =  hclust(dist (data. raw) ,  'ave') 

#  the  agglomeration  method  can  be  specified  "ward.D",  "ward.D2",  "single", 
"complete",  "average"  (=  UPGMA),  "mcquitty"  (=  IaIPGNA),  "median"  (=  IaIPGMC) 
or  "centroid"  (=  UPGMC) 

Plot  a  clustering  diagram  (Fig.  4.45) 

par  (mfrow=c(l,  1)) 

#  very  simple  dendrogram 
plot(hc) 
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4  Data  Visualization 


Cluster  Dendrogram 

LO  „ 


O 


.c 
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Fig.  4.45  Clustering  dendrogram  using  the  Health  Behavior  Risks  case-study 


summary ( data_2$T0TINDA) ;  summary ( data_2$RFDRHV4 ) 


## 

Min. 

1st  Qu. 

Median 

Mean  3rd  Qu. 

Max. 

## 

1.00 

1.00 

1.00 

1.56 

2.00 

9.00 

## 

Min. 

1st  Qu. 

Median 

Mean  3rd  Qu. 

Max. 

## 

1.0 

1.0 

1.0 

1.3 

1.0 

9.0 

cutree(hCj  k  =  2) 

##  [1]  1111111111111111111111111111111111 


##  [885]  1111111111111111111111111111111111 
##  [919]  1111111111112222222222222222222222 
##  [953]  2222222222222222222222222222222222 
##  [987]  22222222222222 

#  alternatively  specify  the  height,  which  is,  the  value  of  the  criterion  ass 
ociated  with  the 

#  clustering  method  for  the  particular  agglomeration  --  cutree(hc,  h=  10) 

tabie(cutree(hCj  h=  10))  #  cluster  distribution 
## 

##12 
##  930  70 

Let’s  try  to  identify  the  number  of  cases  for  varying  number  of  clusters. 


#  To  identify  the  number  of  cases  for  varying  number  of  clusters  we 

#  can  combine  calls  to  cutree  and  table  in  a  call  to  sapply  - 

#  to  see  the  sizes  of  the  clusters  for  $2\ge  k  \ge  10$  cluster-solutions: 

#  numbClusters=4; 

myCLusters  =  sappiy(2 : 5}  function(numbCLusters)tabLe(cutree(hCj 
numbCLusters ))) 

names (myCLusters)  <-  paste ("Number  of  CLusters="j  2:5 ,  sep  =  "") 
myCLusters 
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##  $' Number  of  CLusters=2 
## 

##12 
##  930  70 

## 

##  $' Number  of  CLusters=3 
## 

##123 
##  930  50  20 

## 

##  $' Number  of  CLusters=4 
## 

##1234 
##  500  430  50  20 

## 


##  $' Number  of  CLusters=5 
## 

##  1  2  3  4  5 
##  500  430  10  40  20 


Next,  inspect  the  cluster  labels  for  all  SubjectIDs: 

#To  see  which  SubjectIDs  are  in  which  clusters: 
tabLe(cutree(hCj  k=2)) 

## 

##12 
##  930  70 

groups .k. 2  <-  cutree(hCj  k  =  2) 

sappLy (unique (groups .k. 2) ,  funct±on(g)data_2$ID[groups.k.2  ==  g]) 

We  can  examine  which  TOTINDA  (Leisure  time  physical  activities  per  month, 
1  =  Yes,  2  =  No,  9  =  Don’t  know/Refused/Missing)  and  which  RFDRHV4  are  in 
which  clusters  (Fig.  4.46): 


Fig.  4.46  Scatterplot 
between  two  variables  in  the 
Health  Behavior  Risks  case- 
study 
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4  Data  Visualization 


groups. k. 3  <-  cutree(hCj  k  =  3) 

sappiy (unique (groups . k. 3) j  f unction (g)dutu_2$T0TINDA  [groups . k. 3  ==  g]) 

##  [[1]] 

##  [1]  1111111111111111111111111111111111 
1 

## 


##  [911]  22222222222222222222 
## 

##  [[2]] 

##  [1]  22222222222222222222222222222299999 
##  [36]  999999999999999 
## 

##  [[3]] 

##  [1]  99999999999999999999 

sappiy (unique (groups .k. 3) j  function(g)data_2$RFDRHV4[groups.k.3  ==  g]) 

##  [[1]] 

##  [1]  1111111111111111111111111111111111 
1 

##  ... 

##  [911]  22222222222222222222 
## 

##  [[2]] 

##  [1]  22222222222222222222222222222222222 
##  [36]  222229999999999 
## 

##  [[3]] 

##  [1]  99999999999999999999 

#  Perhaps  there  are  intrinsically  3  groups  here  e.g.,  1,  2  and  9  . 
groups . k. 3  <-  cutree(hCj  k  =  3) 

sappiy  (unique (groups .k. 3) j  f unction (g)data_2$T0TINDA  [groups . k. 3  ==  g]) 

##  [[1]] 

##  [1]  1111111111111111111111111111111111 
1 


##  [911]  22222222222222222222 
## 

##  [[2]] 

##  [1]  22222222222222222222222222222299999 
##  [36]  999999999999999 
## 

##  [[3]] 

##  [1]  99999999999999999999 

sappiy (unique (groups .k. 3) j  f unction (g)data_2$RFDRHV4  [groups . k. 3  ==  g]) 

##  [[1]] 

##  [1]  11111111111111111111111111111111111 
##  ... 

##  [911]  22222222222222222222 
## 

##  [[2]] 
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##  [1]  22222222222222222222222222222222222 
##  [36]  222229999999999 
## 

##  [[3]] 

##  [1]  99999999999999999999 

#  Note  that  there  is  quite  a  dependence  between  the  outcome  variables. 
pLot(data_2$RFDRHV4j  data_2$T0TINDA) 


#  drill  down  deeper 

table (groups . k . 3 j  data_2$RFDRHV4 ) 


•  w  —  j 

—  ' 

## 

## 

groups . k. 3 

1 

2 

9 

## 

1 

910 

20 

0 

## 

2 

0 

40 

10 

## 

3 

0 

0 

20 

To  characterize  the  clusters,  we  can  look  at  cluster  summary  statistics,  like  the 
median,  of  the  variables  that  were  used  to  perform  the  cluster  analysis.  These  can  be 
broken  down  by  the  groups  identified  by  the  cluster  analysis.  The  aggregate  function 
will  compute  statistics  (e.g.,  median)  on  many  variables  simultaneously.  Let’s 
examine  the  median  values  for  each  variable  we  used  in  the  cluster  analysis,  broken 
up  by  cluster  groups: 


aggregate  (data_2j  List  (groups .  /?.  3) ,  median) 


## 

Group. 1  ID 

AGE_ 

_G 

SEX 

RACEGR3 

IMPEDUC  IMPMRTL  EMPLOY1  INCOMG  CVDINFR4 

## 

1 

1  465.5 

5 

2 

1 

5 

1 

2  4  2 

## 

2 

2  955.5 

6 

2 

4 

6 

5 

8  6  2 

## 

3 

3  990.5 

6 

2 

9 

6 

6 

8  6  2 

## 

CVDCRHD4  CVDSTRK3 

DIABETE3 

RFSMOK3 

RFDRHV4 

FRTLT1 

VEGLT1  TOTINDA 

## 

1 

2.0 

2 

3 

1 

1 

1 

1  1 

## 

2 

2.0 

2 

3 

2 

2 

9 

9  2 

## 

3 

4.5 

2 

4 

9 

9 

9 

9  9 

4. 6. 2  Additional  ggp lot  Examples 


Below,  we  will  show  additional  visualization  examples. 


Housing  Price  Data 


This  example  uses  the  SOCR  Home  Price  Index  data  of  19  major  US  cities  from 
1991  to  2009  (Fig.  4.47). 


value 
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4  Data  Visualization 


HomePrice1ndex:1 991-2009 
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Fig.  4.47  Home  price  index  plot  over  time 
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Library (rvest) 

#  draw  data 

wihi_urL  <-  read_htmL  (" http : //wiki .  socr .  umich.edu/index.php/SOCR_Data_Dinov 
091609_SnP_HomePriceIndex" ) 

hm_price_index< -  htmL_tabLe(htmL_nodes(\A)iki_urLJ  "tabLe") [ [1] ] ) 
head(hm_price_index) 


## 

Index  Year 

Month  AZ-Phoenix  CA-LosAngeLes  CA-SanDiego 

CA-SanFrancisco 

## 

1 

1  1991  January 

65.26 

95.28 

83.13 

71.17 

## 

2 

2  1991  February 

65.29 

94.12 

81.87 

70.27 

## 

3 

3  1991 

March 

64.60 

92.83 

80.89 

69.56 

## 

4 

4  1991 

April 

64.35 

92.83 

80.73 

69.46 

## 

5 

5  1991 

May 

64.37 

93.37 

81 . 41 

70.13 

## 

6 

6  1991 

June 

64.88 

94.25 

82.20 

70.83 

## 

CO-Denver  DC- 

-Washington 

FL -Miami  FL- 

-Tampa  GA-AtLanta  IL 

-Chicago 

## 

1 

48.67 

89.38 

79.08 

81.75 

69.61 

70.04 

## 

2 

48.68 

88.80 

78.55 

81.76 

69.17 

70.50 

## 

3 

48.85 

87.59 

78.44 

81.43 

69.05 

70.63 

## 

4 

49.20 

87.56 

78.55 

81.46 

69.40 

71.09 

## 

5 

49.51 

88.61 

77.95 

81.33 

69.69 

71.36 

## 

6 

50.09 

89.28 

78.49 

81.77 

70.14 

71.66 

## 

MA -Boston  MI- 

-Detroit  MN-MinneapoLis 

NC-CharLotte  NV-LasVegas  NY-NeiAjYorh 

## 

1 

64.97 

58.24 

64.21 

73.32 

80.96 

74.59 

## 

2 

64.17 

57.76 

64.20 

73.26 

81.58 

73.69 

## 

3 

63.57 

57.63 

64.19 

72.75 

81.65 

72.87 

## 

4 

63.35 

57.85 

64.30 

72.88 

81.67 

72.29 

## 

5 

63.84 

58.36 

64.75 

73.26 

82.02 

72.63 

## 

6 

64.25 

58.90 

64.95 

73.49 

81.91 

73.50 

## 

OH-CLeveLand 

OR-PortLand 

WA-SeattLe 

Composite-10 

## 

1 

68.24 

56.53 

65.53 

78.53 

## 

2 

67.96 

56.94 

64.60 

77.77 

## 

3 

68.18 

58.03 

64.47 

77.00 

## 

4 

69.10 

58.39 

65.09 

76.86 

## 

5 

69.92 

58.90 

66.03 

77.31 

## 

6 

70.55 

59.54 

66.68 

78.02 

hm_ 

price_index  <- 

hm_price_index[ j  c(-2. 

-3)] 

coLnames(hm_price_index) [1]  <-  c('time') 
require(reshape ) 

hm_index_meLted  =  meLt(hm_price_indeXj  id.vars= 'time  )  #a  common  trick  for  p 
lot,  wide  ->  long  format 

ggpLot(data=hm_index_meLtedj  aes(x=timej  y=vaLuej  coior=variabLe) )  + 
geom_Line(size=1.5)  +  ggtitLe( "HomePricelndex: 1991-2009" ) 

Modeling  the  Home  Price  Index  Data  (Fig.  4.48) 


#  Linear  regression  and  predict 

hm_price_index$pred  =  predict ( Lm( ' CA-SanFrancisco'  ~  ' CA-LosAngeLes' , 
data=hm_price_index) ) 

ggpLot(data=hm_price_indeXj  aes(x  =  ' CA-LosAngeLes' ) )  + 
geom_point(aes(y  =  'CA-SanFrancisco  ))  + 
geom_Line(aes(y  =  pred)j  coior= 'Magenta ' ,  size=2)  + 
ggtitLe( "PredictHomelndex  SF  -  LA") 
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4  Data  Visualization 


PredictHomelndex  SF  -  LA 
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ISO  200 

CA-LosAngetes 


250 


Fig.  4.48  Predicting  the  San  Francisco  home  process  using  data  from  the  Los  Angeles  home  sales 


We  can  also  use  ggplot  to  draw  pairs  plots  (Fig.  4.49). 


#  install . packages( "GGally" ) 
require(GGaLLy) 

pairs  <-  hm_price_index[ j  10:15] 
head(pairs ) 


## 

GA-Atianta 

IL-Chicago  MA 

-Boston  MI-. 

Detroit 

MN-Minneapolis  NC- 

■Charlotte 

##  1 

69.61 

70.04 

64.97 

58.24 

64.21 

73.32 

##  2 

69.17 

70.50 

64.17 

57.76 

64.20 

73.26 

##  3 

69.05 

70.63 

63.57 

57.63 

64.19 

72.75 

##  4 

69.40 

71.09 

63.35 

57.85 

64.30 

72.88 

##  5 

69.69 

71.36 

63.84 

58.36 

64.75 

73.26 

##  6 

70.14 

71.66 

64.25 

58.90 

64.95 

73.49 

colnames(pairs) 

<-  c( "Atlanta 

"j  "Chicago 

",  " Boston ",  "Detroit"  j 

"Minneapolis" j  "Charlotte") 

ggpairs(pairs)  #  you  can  define  the  plot  design  by  claim  "upper",  "lower", 
"diag"  etc. 
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Fig.  4.49  A  more  elaborate  pairs  plot  of  the  home  price  index  dataset  illustrating  the  distributions 
of  home  prices  within  a  metropolitan  area,  as  well  as  the  paired  relations  between  regions 


Map  of  the  Neighborhoods  of  Los  Angeles  (LA) 

This  example  interrogates  data  of  1 10  LA  neighborhoods,  which  includes  measures 
of  education,  income,  and  population  demographics. 

Here,  we  select  the  Longitude  and  Latitude  as  the  axes,  mark  these  110  Neigh¬ 
borhoods  according  to  their  population,  fill  out  those  points  according  to  the  income 
of  each  area,  and  label  each  neighborhood  (Fig.  4.50). 
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4  Data  Visualization 
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Fig.  4.50  Bubble  plot  of  Los  Angeles  neighborhood  location  (longitude  vs  latitude),  population 
size,  and  income 


Library (rvest) 
require(ggpLot2 ) 

#draw  data 

wiki_urL  <-  read_htmi( "http: //wiki,  socr .  umich.edu/index.php/SOCR_Data_LA_Nei 

ghborhoods_Data ") 

htmL_nodes ( wiki_urL}  "#content ") 

##  {xmL_nodeset  (1)} 

##  [1]  <div  id=" content"  cLass="mw-body -primary"  roLe="main" >\n\t<a  id="top 
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LA_Nbhd_data  <-  htmL_tabLe(htmL_nodes(wiki_urLj  "tabLe" ) [ [2] ] ) 
#display  several  lines  of  data 
head(  LA_Nbhd_data)  ; 


## 

LA_Nbhd 

Income 

SchooLs  Diversity  Age  Homes 

Vets 

Asian 

## 

1 

Adams_ 

Normandie 

29606 

691 

0.6 

26  0.26 

0.05 

0.05 

## 

2 

ArLeta 

65649 

719 

0.4 

29  0.29 

0.07 

0.11 

## 

3 

ArLington_Heights 

31423 

687 

0.8 

31  0.31 

0.05 

0.13 

## 

4 

Atwater_ViLLage 

53872 

762 

0.9 

34  0.34 

0.06 

0.20 

## 

5 

BaLdwinJHi LLs/Crenshaw 

37948 

656 

0.4 

36  0.36 

0.10 

0.05 

## 

6 

BeL-Air 

208861 

924 

0.2 

46  0.46 

0.13 

0.08 

## 

BLack 

Latino 

White  PopuLation 

Area  Longitude 

Latitude 

## 

1 

0.25 

0.62 

0.06 

31068 

0.8  -118.3003 

34.03097 

## 

2 

0.02 

0.72 

0.13 

31068 

3.1  -118.4300 

34. 

24060 

## 

3 

0.25 

0.57 

0.05 

22106 

1.0  -118.3201 

34. 

04361 

## 

4 

0.01 

0.51 

0.22 

14888 

1.8  -118.2658 

34. 

12491 

## 

5 

0.71 

0.17 

0.03 

30123 

3.0  -118.3667 

34. 

01909 

## 

6 

0.01 

0.05 

0.83 

7928 

6.6  -118.4636 

34.09615 

theme_set( theme_grey( ) ) 

#treat  ggplot  as  a  variable 

#When  claim  "data",  we  can  access  its  column  directly  eg"x  =  Longitude" 
pLotl  =  ggpiot(dota=LA_Nbhd_dotQj  aes(x=LA_Nbhd_data$Longitudej 
y=LA_Nbhd_data$Latitude) ) 

#you  can  easily  add  attribute,  points,  label(eg: text) 

pLotl  +  geom_point(aes(size=PopuLationj  fiii=LA_Nbhd_data$Income) ,  pch=21} 
stroke=0.2j  aLpha=0.7j  coLor=2)+ 

geom_text(aes( LabeL=LA_Nbhd_data$LA_Nbhd)j  size=1.5j  hjust=0.5j  vjust=2j 
check_overLap  =  T)  + 

scaLe_size_area( )  +  scaLe_fiLL_distiLLer( Limit s=c (range ( LA_Nbhd_data$Incom 
e))j  paLette= ' RdBu  ,  na.vaLue= 'white ' j  name= ' Income ‘ )  + 

scaLe_y_continuous( Limits=c(min( LA_Nbhd_data$Latitude) ,  max( LA_Nbhd_data$L 
atitude)))  + 

coord_fixed(ratio=l)  +  ggtitLe(  ' LA  Neughborhoods  Scatter  PLot  ( Location , 
PopuLationj  Income) ' ) 


Observe  that  some  areas  (e.g.,  Beverly  Hills)  have  disproportionately  higher 
incomes.  In  addition,  it  is  worth  pointing  out  that  the  resulting  plot  resembles  this 
plot  of  LA  County  (Fig.  4.51). 


Latin  Letter  Frequency  in  Different  Languages 

This  example  uses  ggplot  to  interrogate  the  SOCR  Latin  letter  frequency  data, 
which  includes  the  frequencies  of  the  26  common  Latin  characters  in  several 
derivative  languages.  There  is  quite  a  variation  between  the  frequencies  of  Latin 
letters  in  different  languages  (Figs.  4.52  and  4.53). 
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Fig.  4.51  The  Los  Angeles  county  map  resembles  the  plot  on  Fig.  4.50 
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Pte  Chart 


variable 
|  English 
I  French 
■  Gentian 
I  Spanish 
Ftoit-jguese 
Esperanto 
Kaftan 
|  TuriOsb 
Swedish 
|  Polish 
'oKi_Fora 
f  Dulch 


Fi>Fia494i««4iiiiiiiiPBiBBiB 

abode  I  ghl  |  k  I  m  ii  o  Others  p  -q  r  s  [  u  v  w  a  y  z 

Letter 


Fig.  4.52  Frequency  distributions  of  Latin  letters  in  several  languages 


vafiue 
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4  Data  Visualization 


Pie  Chart 


variable 
I  English 
||  F  ns  nth. 

|  German 
||  Spanish 

J  Portuguese 

3  Esperanto 

J  i»iien 

"2  TljftiJh 
|  Swedish 
|  Polish 
_j  Toki_Pw»o 
I  Dutch 


Letter 


Fig.  4.53  Pie  chart  similar  to  the  stacked  bar  chart,  Fig.  4.52 


Library (rvest) 

wiki_urL  <-  read_htmL ( "http: //wiki .  socr.  umich .  edu/index. php/SOCR_LetterFrequ 
encyData ") 

Letter<-  html_tabLe(html_nodes(wiki_urLJ  "table") [[1]]) 
summary ( Letter ) 


## 

Letter 

EngLish 

French 

German 

## 

Length: 27 

Min. 

: 0 . 00000 

Min.  : 0.00000 

Min. 

:0. 00000 

## 

CLass 

character 

1st  Qu. 

:0. 01000 

1st  Qu. :0. 01000 

1st  Qu. 

:0. 01000 

## 

Mode 

character 

Median 

:0. 02000 

Median  : 0.03000 

Median 

:0. 03000 

## 

Mean 

: 0.03667 

Mean  : 0.03704 

Mean 

: 0.03741 

## 

3rd  Qu. 

:0 . 06000 

3rd  Qu.: 0.06500 

3rd  Qu. 

: 0.05500 

## 

Max. 

: 0.13000 

Max.  : 0.15000 

Max. 

: 0.17000 

## 

Spanish 

Portuguese 

Esperanto 

ItaLian 

## 

Min. 

: 0 . 00000 

Min. 

0. 00000 

Min.  : 0.00000 

Min. 

0.00000 

## 

1st  Qu. 

:0. 01000 

1st  Qu. 

0. 00500 

1st  Qu. :0. 01000 

1st  Qu. 

0. 00500 

## 

Median 

:0. 03000 

Median 

0. 03000 

Median  : 0.03000 

Median 

0.03000 

## 

Mean 

: 0.03815 

Mean 

0.03778 

Mean  : 0.03704 

Mean 

0.03815 

## 

3rd  Qu. 

: 0. 06000 

3rd  Qu. 

0. 05000 

3rd  Qu.: 0.06000 

3rd  Qu. 

0. 06000 

## 

Max. 

: 0. 14000 

Max. 

0.15000 

Max.  : 0.12000 

Max. 

0.12000 
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## 

Turkish 

Swedish 

PoLish 

Toki_ 

Pona 

## 

Min. 

0.00000 

Min.  : 0.00000 

Min.  :0.00000 

Min. 

: 0 . 00000 

## 

1st  Qu. 

0. 01000 

1st  Qu. :0. 01000 

1st  Qu. :0.01500 

1st  Qu. 

:0. 00000 

## 

Median 

0.03000 

Median  : 0.03000 

Median  : 0.03000 

Median 

:0. 03000 

## 

Mean 

0.03667 

Mean  : 0.03704 

Mean  : 0.03704 

Mean 

:0.03704 

## 

3rd  Qu. 

0.05500 

3rd  Qu. : 0.05500 

3rd  Qu. :0.04500 

3rd  Qu. 

:0. 05000 

## 

Max. 

0.12000 

Max.  : 0.10000 

Max.  :0.20000 

Max. 

: 0.17000 

## 

Dutch 

Avgerage 

## 

Min. 

0.00000 

Min.  : 0.00000 

## 

1st  Qu. 

0. 01000 

1st  Qu. :0. 01000 

## 

Median 

0.02000 

Median  : 0.03000 

## 

Mean 

0.03704 

Mean  : 0.037 41 

## 

3rd  Qu. 

0 . 06000 

3rd  Qu. :0. 06000 

## 

Max. 

0. 19000 

Max.  :0.12000 

head( Letter) 

##  Letter  EngLish  French  German  Spanish  Portuguese  Esperanto  ItaLian 


## 

1 

a 

0.08 

0.08 

0.07 

0. 

13 

0.15 

0.12 

0. 

12 

## 

2 

b 

0.01 

0.01 

0.02 

0. 

01 

0.01 

0.01 

0. 

01 

## 

3 

c 

0.03 

0.03 

0.03 

0. 

05 

0.04 

0.01 

0. 

05 

## 

4 

d 

0.04 

0.04 

0.05 

0. 

06 

0.05 

0.03 

0. 

04 

## 

5 

e 

0.13 

0.15 

0.17 

0. 

14 

0.13 

0.09 

0. 

12 

## 

6 

f 

0.02 

0.01 

0.02 

0. 

01 

0.01 

0.01 

0. 

01 

## 

Turkish 

Swedish 

PoLish 

Toki_Pona  Dutch  Avgerage 

## 

1 

0.12 

0.09 

0.08 

0. 

17 

0.07 

0.11 

## 

2 

0.03 

0.01 

0.01 

0. 

00 

0.02 

0.01 

## 

3 

0.01 

0.01 

0.04 

0. 

00 

0.01 

0.03 

## 

4 

0.05 

0.05 

0.03 

0. 

00 

0.06 

0.04 

## 

5 

0.09 

0.10 

0.07 

0. 

07 

0.19 

0.12 

## 

6 

0.00 

0.02 

0.  00 

0. 

00 

0.01 

0.01 

sum( Letter [j  -1])  ^treasonable 

##  [1]  13.08 

require(reshape ) 

Library(scaLes) 

dtm  =  meLt(Letter[j  -14] ,  id.vars  =  c(  ' Letter ') ) 
p  =  ggpLot(dtmj  aes(x  =  Letter j  y  =  vaLuej  fiLL  =  variabLe) )  + 
geom_bar (position  =  "fiLL"j  stat  =  "identity")  + 
scaLe_y_continuous( LabeLs  =  percent_format( ) )+ggtitLe( ' Pie  Chart') 
#o r  exchange 

#p  =  ggplot(dtm,  aes(x  =  variable,  y  =  value,  fill  =  Letter))  + 
geom_bar (position  =  "fill",  stat  =  "identity")  + 
scale_y_continuous(labels  =  percent_format ( ) ) 

P 


#gg  pie  plot  actually  is  stack  plot  +  polar  coordinate 
p  +  coord_poLar( ) 


You  can  experiment  with  the  SOCR  interactive  motion  chart,  see  Fig.  4.54. 
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http://socr.umich.edu/HTML5/ 

MotionChart/ 


Fig.  4.54  Live  demo:  6D  SOCR  MotionChart 


4.7  Assignments  4:  Data  Visualization 
4.7.1  Common  Plots 


Use  the  Divorce  data  (Case  Study  01)  or  the  TBI  dataset  (CaseStudyll_TBI)  to 
generate  appropriate  visualization  of  histograms,  density  plots,  pie  charts,  heatmaps, 
barplots,  and  paired  correlation  plots. 


4. 7.2  Trees  and  Graphs 

Use  the  SOCR  Resource  Hierarchical  data  (JSON)  or  the  DSPA  Dynamic  Certificate 
Map  (JSON)  to  generate  some  tree/graph  displays  of  the  structural  information. 

The  code  fragment  below  shows  an  example  of  processing  a  JSON  hierarchy. 

Library (j son  Lite) 

Library (RCurL ) 

L ibrary ( data . tree ) 

urL  <-  "http://socr. umich.edu/htmL/navigators/D3/xmL/S0CR_HyperTree.json" 

raw_data  <-  getURL(urL) 

document  <-  fromJSON(ra\Aj_data) 

tree  <-  Node$ne\A/(document$name) 

for(i  in  seq_Len( Length (document) ) )  { 

tree$AddChiLd( document$chiLdren$name[ [i] ] ) 

for(j  in  seq_Len(Length(document$chiLdren$chiLdren[ [i] ] ) ) )  { 

tree$chiLdren [ [i] ]$AddChiLd( document$chiLdren$chiLdren [ [i] ]$name[ [j ] ] ) 
for(k  in  seq_Len( Length ( document$chiLdren$chiLdren[ [i] ]$chiLdren[ [j ] ] ))){ 

tree$chiLdren[ [i] ]$chiLdren [ [j ] ]$AddChiLd( ( document$chiLdren$chiLdren[[i]]$ch 
iLdren[ [j ] ]$name[ [k] ] ) ) 

} 

} 

} 

suppressMessages(Library( igraph)) 

pLot(as .igraph(treej  directed  =  T}  direction  =  "cLimb")) 

suppressMes sages ( L ibrary ( networkD3 ) ) 
treenetiA/orh  <-  ToDataFrameNetworh(treej  "name") 
simpLeNetiA/orh(treenetworkj  fontSize  =  10) 


References 
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4.7.3  Exploratory  Data  Analytics  (EDA) 

•  Use  SOCR  Oil  Gas  Data  to  generate  plots:  (i)  read  data  table,  you  may  need  to  fill 
the  inconsistent  values  with  NAs;  (ii)  data  preprocessing:  select  variables,  type 
convert,  etc.;  (iii)  generate  two  plots:  the  first  plots  includes  two  subplots, 
consumption  plots  and  production  plots;  the  second  figure  includes  three  sub¬ 
plots,  for  fossil,  nuclear  and  renewable,  respectively.  To  draw  the  subplots,  you 
can  use  facet  _grid()  ;  (iv)  all  figures  should  have  year  as  x  axis;  (v)  the  first 
figure  should  include  three  curves  (fossil,  nuclear  and  renewable)  for  each 
subplot;  the  second  figure  should  include  two  curves  (consumption  and  produc¬ 
tion)  for  each  subplot. 

•  Use  SOCR  Ozone  Data  to  generate  a  correlation  plot  with  the  variables  MTH_1, 
MTH  2,  . . .,  MTH_12.  (Hint:  you  need  to  obtain  the  correlation  matrix  first,  then 
apply  the  corrplot  package.  Try  some  alternative  methods  as  well,  circle, 
pie,  mixed  etc.) 

•  Use  SOCR  CA  Ozone  Data  to  generate  a  3D  surface  plot  (Using  variables 
Longitude ,  Latitude  and  03). 

•  Generate  a  sequence  of  random  numbers  from  student  t  distribution.  Draw  the 
sample  histogram  and  compare  it  with  normal  distribution.  Try  different  degrees 
of  freedom.  What  do  you  find?  Does  varying  the  seed  and  regenerating  the 
student  t  sample  change  that  conclusion? 

•  Use  the  SOCR  Parkinson’s  Big  Meta  data  (only  rows  with  time=0)  to  generate 
a  heatmap  plot.  Set  RowSideColors ,  ColSideColors  and  rainbow.  (Hint:  you  may 
need  to  select  columns,  properly  convert  the  data,  and  normalize  it.) 

•  Use  SOCR  2011  US  Jobs  Ranking  draw  scatter  plot  Overall_Score 
vs.  Average_Income  (USD)  include  title  and  label  the  axes.  Then  try 
qplot  for  Overall_Score  vs.  Average_Income  (USD) :  (1)  fill  with  the 
Stress_Level;  (2)  size  the  points  according  to  Hiring_Potential;  and 
(3)  label  using  Job_Title. 

•  Use  SOCR  Turkiye  Student  Evaluation  Data  to  generate  trees  and  graphs,  using 
cutree  ( )  and  select  any  k  you  prefer.  (Use  variables  Q1-Q28). 
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Linear  Algebra  &  Matrix  Computing 


® 
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Linear  algebra  is  a  branch  of  mathematics  that  studies  linear  associations  using 
vectors,  vector-spaces,  linear  equations,  linear  transformations,  and  matrices.  It  is 
generally  challenging  to  visualize  complex  data,  e.g.,  large  vectors,  tensors,  and 
tables  in  n-dimensional  Euclidian  spaces  (n  >  3).  Linear  algebra  allows  us  to 
mathematically  represent,  computationally  model,  statistically  analyze,  synthetically 
simulate,  and  visually  summarize  such  complex  data. 

Virtually  all  natural  processes  permit  first-order  linear  approximations.  This  is 
useful  because  linear  equations  are  easy  to  write,  interpret,  and  solve.  These  first 
order  approximations  may  be  useful  to  practically  assess  the  process,  determine 
general  trends,  identify  potential  patterns,  and  suggest  associations  in  the  data. 

Linear  equations  represent  the  simplest  type  of  models  for  many  processes. 
Higher  order  models  may  include  additional  non-linear  terms,  e.g.,  Taylor-series 
expansions.  Linear  algebra  provides  the  foundation  for  linear  representation,  analyt¬ 
ics,  computatiponal  solutions,  inference,  and  visualization  of  first-order  affine 
models.  Linear  algebra  is  a  small  part  of  the  larger  mathematics  field  of  functional 
analysis ,  which  is  actually  the  infinite-dimensional  version  of  linear  algebra. 

Specifically,  linear  algebra  allows  us  to  computationally  manipulate,  model, 
solve,  and  interpret  complex  systems  of  equations  representing  large  numbers  of 
dimensions  and  variables.  Arbitrarily  large  problems  can  be  mathematically 
transformed  into  simple  matrix  equations  of  the  form  Ax  =  b  or  Ax  =  Ax. 

In  this  chapter,  we  review  the  fundamentals  of  linear  algebra,  matrix  manipulation 
and  their  applications  to  represent,  model,  and  analyse  real  data.  Specifically,  we  will 
cover  (1)  construction  of  matrices  and  matrix  operations,  (2)  general  matrix  algebra 
notations,  (3)  eigenvalues  and  eigenvectors  of  linear  operators,  (4)  least  squares 
estimation,  and  (5)  linear  regression  and  variance-covariance  matrices. 
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5  Linear  Algebra  &  Matrix  Computing 


5.1  Matrices  (Second  Order  Tensors) 

5.1.1  Create  Matrices 


The  easiest  way  to  create  a  matrix  is  by  using  the  matrix  ()  function,  which 
organizes  the  elements  of  a  vector  into  specified  positions  into  a  matrix. 


seql<-seq(l:6) 

ml<-matrix(seqlj  nro\A)=2j  ncoL=3) 
ml 


##  [A]  [A]  [A] 

##  [1 A  1  3  5 

##  [2j  ]  2  4  6 

m2<-diag(seql ) 
m2 


## 

##  [. i  A 

##  [2  A 
##  [3  A 
##  [4  A 
##  [  5  A 

##  [6  A 


[A]  [A]  [A]  [,4]  [A]  [,6] 


1 

0 

0 

0 

0 

0 


0 

2 

0 

0 

0 

0 


0 

0 

3 

0 

0 

0 


0 

0 

0 

4 

0 

0 


0 

0 

0 

0 

5 

0 


0 

0 

0 

0 

0 

6 


m3< -matrix (rnorm( 20 )  j  nroiAj=5) 
m3 

##  [A]  [A]  [A]  [A] 

##  [lj]  0.4877535  0.22081284  -0.6067573  -0.8982306 

##  [2 }]  -0.1672924  -1.49020015  0.3038424  -0.1875045 

##  [3 A  -0.4771204  -0.39004837  1.1160825  -0.6948070 

##  [4 A  -0.9274687  0.08378863  0.3846627  0.2386284 

##  [5 A  0.8672767  -0.86752831  1.5536853  0.3222158 


The  function  diag  ( )  is  very  useful.  When  the  object  is  a  vector,  it  creates  a 
diagonal  matrix  with  the  vector  in  the  principal  diagonal. 


diag(c(lj  2,  3)) 

##  [A]  [A]  [A] 

##  [l  A  2  0  0 

##  [2 A  0  2  0 

##  [3 A  0  0  3 

When  the  object  is  a  matrix,  diag  ( )  returns  its  principal  diagonal. 


diag (ml ) 

##  [1]  1  4 
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When  the  object  is  a  scalar,  diag  (k)  returns  a  k  x  k  identity  matrix. 


diag(4) 


## 

##  [l}] 
##  [2}] 
##  [3j ] 
##  [4j ] 


[A]  l A]  [A]  [A] 
1000 
0100 
0010 
0001 


5.1.2  Adding  Columns  and  Rows 


Function  cbind  ( )  and  rbind  ( )  are  used  throughout  the  textbook. 


cl<-l:5 

m4<-cbind(m3j  cl) 
m4 

cl 
1 
2 

3 

4 

5 


##  [A]  [A] 
##  0.4877535  0.22081284 
##  -0.1672924  -1.49020015 
##  -0.4771204  -0.39004837 
##  -0.9274687  0.08378863 
##  0.8672767  -0.86752831 
##  rl  1.0000000  2.00000000 


[A] 

-0.6067573 
0.3038424 
1.1160825 
0.3846627 
1 . 5536853 
3 . 0000000 


[A] 

-0.8982306 
-0.1875045 
-0.6948070 
0.2386284 
0.3222158 
4 . 0000000 


## 

##  [. i  A 
##  [2,] 
##  [3j ] 

##  [4,  ] 

##  [ 5 ,] 


0.4877535  0.22081284  -0.6067573  -0.8982306 
-0.1672924  -1.49020015  0.3038424  -0.1875045 
-0.4771204  -0.39004837  1.1160825  -0.6948070 
-0.9274687  0.08378863  0.3846627  0.2386284 
0.8672767  -0.86752831  1.5536853  0.3222158 


rl<-l :4 

m5<-rbind(m3j  rl) 
m5 


Note  that  m5  has  a  row  name  rl  in  the  fourth  row.  We  can  remove  row/column 
names  by  naming  them  as  NULL. 


dimnames(m5)<- List (NULL j  NULL) 
m5 


## 

##  [. i ,] 
##  [2 ,] 
##  [3j ] 

##  [4,] 
##  [5,] 
##  [6A 


[A] 
0.4877535 
■0.1672924 
■0.4771204 
■0.9274687 
0.8672767 
1 . 0000000 


[A] 

0.22081284 

■1.49020015 

■0.39004837 

0.08378863 

■0.86752831 

2.00000000 


[A] 
-0.6067573 
0.3038424 
1.1160825 
0.3846627 
1 . 5536853 
3 . 0000000 


[A] 
-0.8982306 
-0.1875045 
-0.6948070 
0.2386284 
0.3222158 
4 . 0000000 
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5.2  Matrix  Subscripts 

Each  element  in  a  matrix  has  a  location.  A  [  i ,  j  ]  means  the  ith  row  and  jth  column  in  a 
matrix  A.  We  can  also  access  specific  rows  or  columns  using  matrix  subscripts. 


m6<-matrix( 1 : 

12 j  nro\Aj=3) 

m6 

## 

[A] 

[A] 

[A]  [j 

A] 

##  [ia 

l 

4 

7 

10 

##  [2 A 

2 

5 

8 

11 

##  [3 A 

3 

6 

9 

12 

m6[lj  2] 

##  [1]  4 

m6[lj  ] 

##  [1] 

1  4 

7  10 

m6[J  2] 

##  [1]  4 

5  6 

m6[j  c(2 

,  3)] 

## 

[A] 

[A] 

##  [ia 

4 

7 

##  [2 A 

5 

8 

##  [3 A 

6 

9 

5.3  Matrix  Operations 
5.3.1  Addition 

Elements  in  the  same  position  are  added  to  represent  the  result  at  the  same  location. 


m7<-matrix(l:6j  nrow=2) 
m7 

##  [A]  [A]  [A] 

##  [lj]  1  3  5 

##  [2j  ]  2  4  6 

m8<-matrix(2:7j  nrow  =2) 
m8 

##  [A]  [A]  [A] 

##  [1 A  2  4  6 

##  [2 /]  3  5  7 

m7+m8 


##  [A]  [,2]  [A] 

##  [1 A  3  7  11 

##  [2}  ]  5  9  13 
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5.3.2  Subtraction 

Similar  to  addition,  matrix  subtraction  reflects  differences  between  elements  in  same 
position. 

m8-m7 

##  [A]  [A]  [A] 

##  [lj]  1  1  1 

##  [2j  ]  1  1  1 

m8-l 

##  [A]  [A]  [A] 

##  [1 A  1  3  5 

##  [2 A  2  4  6 


5.3.3  Multiplication 

Multiplicative  oparations  are  different  than  additive  operations.  We  can  do 
elementwise  multiplication  or  matrix  multiplication.  For  matrix  multiplication,  the 
matrix  dimensions  have  to  match.  That  is,  the  number  of  columns  in  the  first  matrix 
must  equal  to  the  number  of  rows  in  the  second  matrix. 


Elementwise  Multiplication 

Multiplication  between  elements  in  same  position. 


mS*m7 

##  [A]  [A]  [A] 

##  [1 A  2  12  30 

##  [2 /]  6  20  42 


Matrix  Multiplication 

The  resulting  matrix  will  have  the  same  number  of  rows  as  the  first  matrix  and  the 
same  number  of  columns  as  the  second  matrix. 
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dim(m8) 

##  [1]  2  3 

m9<-matrix(3:8j  nro\Aj=3) 
m9 

##  [A]  [A] 

##  [lj]  3  6 

##  [2j]  4  7 

##  [3/]  5  8 

dim(m9) 

##  [1]  3  2 

m8%*%m9 

##  [A]  [A] 

##  [ 1,  ]  52  88 

##  [2 A  64  109 


We  made  a  2  x  2  matrix  resulting  from  multiplying  two  matrices  2  x  3  *  3  x  2. 
The  process  of  multiplying  two  vectors  is  called  outer  product.  Assume  we  have 
two  vectors  u  and  v,  in  matrix  multiplication  their  outer  product  is  represented 

T* 

mathematically  as  uv  .  In  R,  the  operator  for  outer  product  is  %o%. 


u<-c(lJ  2,  3,  4j  5) 
v<-c(4j  5j  6j  7j  8) 
u°Xo%v 


## 

[A] 

[A] 

[A] 

[A] 

[A] 

##  [ 1 ,] 

4 

5 

6 

7 

8 

##  [2,] 

8 

10 

12 

14 

16 

##  [3, ] 

12 

15 

18 

21 

24 

##  [  4 ,] 

16 

20 

24 

28 

32 

##  [ 5 ,] 

20 

25 

30 

35 

40 

## 

[A] 

[A] 

[A] 

[A] 

[A] 

##  [1,] 

4 

5 

6 

7 

8 

##  [2,] 

8 

10 

12 

14 

16 

##  [ 3 ,] 

12 

15 

18 

21 

24 

##  [  4 ,] 

16 

20 

24 

28 

32 

##  [ 5  ,j 

20 

25 

30 

35 

40 

What  are  the  differences  between  u  %  *  %  t(v),  u  %  *  %  t(v),  u  *  t(v),  and 
u  *  v? 
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5.3.4  Element-wise  Division 

Elementwise  division  is  defined  similarly  to  alement-wize  multipliaiton,  however, 
this  is  different  than  multiplicative  inversion. 

m8/m7 

##  [A]  [A]  [A] 

##  [lj]  2.0  1.333333  1.200000 
##  [2 ,  ]  1.5  1.250000  1.166667 

m8/2 

##  [A]  [A]  [A] 

##  [lj]  1.0  2.0  3.0 

##  [2/]  1.5  2.5  3.5 


5.3.5  Transpose 

The  transpose  of  a  matrix  is  a  new  matrix  created  by  swapping  the  columns  and  the 
rows  of  the  original  matrix.  Do  this  in  a  simple  function  t  ( ) . 

m8 

##  [A]  [A]  [A] 

##  [1 A  2  4  6 

##  [2 A  3  5  7 

t(m8) 

##  [A]  [A] 

##  [1 A  2  3 

##  [2 A  4  5 

##  [3 A  6  7 

Notice  that  the  [1,2]  element  in  m8  is  the  [2,  1]  element  in  the  transpose  matrix 
t  (m8 )  . 


5.3.6  Multiplicative  Inverse 

The  inverse  of  a  matrix  (A-1)  is  its  multiplicative  inverse.  That  is,  multiplying  the 
original  matrix  (A)  by  it’s  inverse  (A-1)  yields  an  identity  matrix  that  has  l’s  on  the 
diagonal  and  0’s  off  the  diagonal. 


AA-1  =/ 

Assume  we  have  the  following  2x2  matrix: 


a  b 
c  d 
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Its  matrix  inverse  is 

1  f  d  — &\ 
ad  —  be  \  —c  a  J 

For  higher  dimensions,  the  formula  for  computing  the  inverse  matrix  is  more 
complex.  In  R,  we  can  use  the  solve  ( )  function  to  calculate  the  matrix  inverse,  if  it 
exists. 

ml0<-matrix(l :4j  nrow=2) 
ml0 

##  [A]  [A] 

##  [lj]  1  3 

##  [2j]  2  4 

solve (ml0) 

##  [A]  [A] 

##  [1 A  -2  1.5 

##  [2 }]  1  -0.5 

ml 0%*%soL ve(ml0) 

##  [A]  [A] 

##  [1 A  2  0 

##  [2 A  0  1 

Note  that  only  some  matrices  have  inverses.  These  are  square  matrices,  i.e.,  they 
have  the  same  number  of  rows  and  columns,  and  are  non-singular. 

Another  function  that  can  help  us  compute  the  inverse  of  a  matrix  is  the  ginv  ( ) 
function  under  the  MASS  package,  which  reports  the  Moore-Penrose  Generalized 
Inverse  of  a  matrix. 

require (MASS) 

##  Loading  required  package:  MASS 
ginv(ml0) 

##  [A]  [A] 

##  [1 A  "2  1.5 

##  [2 A  1  -0.5 

Also,  the  samae  function  solve  ( )  can  be  used  to  solve  matrix  equations, 
solve  (A,  b)  returns  vector  v  in  the  equation  b  =  Ax  (i.e.,  v  =  A~lb). 

sl<-diag(c(2j  4,  6,  8)) 
s2<-c(lj  2,  3,  4) 
solve(slj  s2) 

##  [1]  0.5  0.5  0.5  0.5 


The  following  Table  5.1  summarizes  some  basic  matrix  operation  functions. 
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Table  5.1  Basic  matrix  operators  in  R 


Expression 

Explanation 

t  (x) 

Transpose 

diag (x) 

Diagonal 

%*% 

Matrix  multiplication 

solve (a,  b) 

Solves  a%*%x  =  bforx 

solve (a) 

Matrix  inverse  of  a 

rowsum (x) 

Sum  of  rows  for  a  matrix-like  object.  rowSums  (x) 
is  a  faster  version 

colSums (x) ,  colSums (x) 

Id.  for  columns 

rowMeans (x) 

Fast  version  of  row  means 

colMeans (x) 

Id.  for  columns 

matl  <-  cbind(c(lj  -1/5) ,  c(-l/3j  1)) 
matl.inv  <-  solve(matl) 

matl. identity  <-  matl.inv  %*%  matl 
matl . identity 

##  [31]  [,2] 

##  [lj]  1  0 

##  [2/]  0  1 

b  <-  c (lj  2) 

x  <-  solve  (matlj  b) 

x 

##  [1]  1.785714  2.357143 


5.4  Matrix  Algebra  Notation 

Let’s  introduce  the  basic  matrix  notation.  The  product  AB  between  matrices  A  and 
B  is  defined  only  if  the  number  of  columns  in  A  equals  the  number  of  rows  in  B.  That 
is,  we  can  multiply  an  m  x  n  matrix  A  by  an  n  x  k  matrix  B  and  the  result  will  be 
ABm  x  k  matrix.  Each  element  of  the  product  matrix,  (ABt  7),  represents  the  product  of 
the  ith  row  in  A  and  the  jth  column  in  B ,  which  are  of  the  same  size  n.  Matrix 
multiplication  is  row-by- column. 


5.4.1  Linear  Models 

Linear  algebra  notation  simplifies  the  mathematical  descriptions  and  manipulations 
of  linear  models,  as  well  as  coding  in  R. 

The  main  point  is  to  show  how  we  can  write  linear  models  using  matrix  notation. 
Later,  we’ll  explain  how  this  is  useful  for  solving  the  least  squares  matrix  equation. 
Let’s  start  by  defining  the  notation  and  matrix  multiplication. 
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5.4.2  Solving  Systems  of  Equations 

Linear  algebra  notation  enables  the  mathematical  analysis  and  the  analytical  solution 
of  systems  of  linear  equations: 


ci  b  ~\~  2  c  —  6 

3a  —  2b  +  c  =  2  . 

2a  +  b  —  c  =3 

It  provides  a  generic  machinery  for  solving  these  problems. 


A  x  b 


That  is:  Ax  =  b ,  which  yields  a  solution  vector  a: 


In  other  words,  A  lAx  =  x  =  A  lb. 

Notice  that  this  parallels  the  solution  of  simple  (univariate)  linear  equations  like: 


(design  matrix)  A 


unknown  x 


simple  constant  term 


5 


b 


The  constant  term,  —3,  can  be  simply  joined  with  the  right-hand-size,  b ,  to  form  a 

/ 

new  term  b  =5  +  3  =  8.  Thus,  the  shifting  factor  is  mostly  ignored  in  linear  models, 
or  linear  equations,  to  simplify  the  equation  to: 


2  ^a^  =  5  +  3  =  8  . 

(design  matrix)  A  unknown  x  y  y 

This  (simple)  linear  equation  is  solved  by  multiplying  both  sides  by  the  inverse 
(reciprocal)  of  the  a  multiplier,  2: 


1 

-2a  = 
2 


1 

2 


8. 


4. 


Thus,  the  unique  solution  is: 
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So,  let’s  use  exactly  the  same  protocol  to  solve  the  corresponding  matrix  equation 
(linear  equations,  Ax  =  b)  using  R  (the  unknown  is  x,  and  the  design  matrix  A  and  the 
constant  vector  b  are  known): 


A  x  b 


A_matrix_vaLues  <-  c(l,  1,  2,  3,  - 2 ,  1,  2}  1 ,  -1) 
A  <-  t(matrix(A_matrix_vaLueSj  nrovj=3y  ncoL=3)) 
b  <-  c(6 ,  2,  3) 

#  to  solve  Ax  =  b}  x=AA{-l}*b 
x  <-  solve  (Ay  b) 

#  Ax  =  b  ==>  x  =  AA{ -1}  *  b 
x 

##  [1]  1.35  1.75  1.45 

#  Check  the  Solution  x=(1.35  1.75  1.45) 

LHS  <-  A  %*%  x 

round  (LHS-b) 

##  [yl] 

##  [ly]  0 

##  [2y]  0 

##  [3 y]  0 


How  about  if  we  want  to  triple-check  the  accuracy  of  the  solve  method  to 
provide  accurate  solutions  to  matrix-based  systems  of  linear  equations? 

We  can  generate  the  solution  (v)  to  the  equation  Ax  —  b  using  first  principles: 

x  =  A~lb. 


A. inverse  <-  soLve(A)  #  the  inverse  matrix  AA{-1} 
xl  <-  A. inverse  %*%  b 
#  check  if  X  and  xl  are  the  same 
x;  xl 

##  [1]  1.35  1.75  1.45 

##  [yl] 

##  [ly]  1.35 
##  [2y  ]  1.75 
##  [3y  ]  1.45 

round(x-xly 6) 

##  [yl] 

##  [ly]  0 
##  [2y]  0 

##  [3y]  0 
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5.4.3  The  Identity  Matrix 


The  identity  matrix  is  the  matrix  analog  to  the  multiplicative  numeric  identity,  i.e., 
the  number  1.  Multiplying  the  identity  matrix  by  any  other  matrix  ( B )  does  not 
change  the  matrix  B.  For  this  to  happen,  the  multiplicative  identity  matrix  must  look 
like: 


The  identity  matrix  is  always  a  square  matrix  with  diagonal  elements  1  and  0  at 
the  off-diagonal  elements. 

If  you  follow  the  matrix  multiplication  rule  above,  you  notice  this  works  out: 


Xxl 


0 

1 

0 


0 

0 

1 


0  0\ 
0  0 
0  0 


*«,  1 


0  0  0  ...  1  0 
\0  0  0  ...  0  1/ 


In  R,  you  can  form  an  identity  matrix  as  follows: 


n  <-  3  #pick  dimensions 
I  <-  diog(n);  I 

##  [A]  [A]  [A] 

##  [lj]  1  0  0 

##  [2  A  0  1  0 

##  [3j]  0  0  1 

A  I;  I  A 

##  [A]  [A]  [A] 

##  [lj]  1  3  2 

##  [2 A  1-2  1 
##  [3 A  2  1  -1 

##  [A]  [A]  [A] 

##  [1 A  2  3  2 

##  [2 A  1-2  1 
##  [3 A  2  1  -1 
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5.5  Scalars,  Vectors  and  Matrices 


Let’s  look  at  this  notation  deeper.  In  the  baseball  player  data,  there  are  three 
quantitative  variables:  Heights,  Weight,  and  Age.  Suppose  the  variable 
Weight  is  represented  as  a  response  Y\,  . . .,  Yn  random  vector. 

We  can  examine  players’  Weight  as  a  function  of  Age  and  Height. 


#  Data :  https : //umich . instructure . com/courses/38100/f iles/f older/data 
(01a_data.txt) 

data  <-  read.tabLe(  '  https :  //umich  .instructure,  com/ fiLes/330381/do\Ajn  Load  Pdown 
Load_frd=l ' ,  as.is=Tj  header=T) 
attach (data) 
head(data) 


## 

Name 

Team 

Position  Height  Weight 

Age 

## 

1 

Adam_Donachie 

BAL 

Catcher 

74 

180 

22.99 

## 

2 

PauL_Bako 

BAL 

Catcher 

74 

215 

34.69 

## 

3 

Ramon_Hernandez 

BAL 

Catcher 

72 

210 

30.78 

## 

4 

Kevin_MiLLar 

BAL 

First_Baseman 

72 

210 

35.43 

## 

5 

Chris_Gomez 

BAL 

First_Baseman 

73 

188 

35.71 

## 

6 

Brian  Roberts 

BAL 

Second  Baseman 

69 

176 

29.39 

We  can  also  use  vector  notation.  We  usually  use  bold  to  distinguish  vectors  from 
the  individual  elements: 


/M 

y2 

UJ 


The  default  representation  of  data  vectors  is  as  columns,  i.e.,  we  have  dimension 
n  x  1,  as  opposed  to  1  x  n  rows. 

Similarly,  we  can  use  math  notation  to  represent  the  covariates  or  predictors:  Age 
and  Height.  In  a  case  with  two  predictors,  we  can  represent  them  like  this: 


fXlA 

Xi  =  !  and  X2 

\Xn,  1  / 

Note  that  for  the  baseball  players  example,  iu=  Agex  and  xit  x  =  Aget  with  Aget 
represent  the  Age  of  the  ith  player,  and  similarly,  vy  2  =  Height h  represents  the 
height  of  the  ith  player.  These  vectors  are  also  thought  of  as  n  x  1  matrices. 

It  is  convenient  to  represent  both  co variates  as  a  matrix: 


XiX2 


*1,1 

*1,2 

Xn,  1 

*«,  2 

X  = 
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This  matrix  is  of  dimension  n  x  2  and  can  be  create  in  R  this  way: 

X  <-  cbind(Agej  Height) 
head(X) 

##  Age  Height 

##  [lj]  22.99  74 

##  [2/]  34.69  74 

##  [3j]  30.78  72 

##  [4/]  35.43  72 

##  [5j  ]  35.71  73 

##  [6/]  29.39  69 

dim(X) 

##  [1]  1034  2 

We  can  also  use  this  notation  to  denote  an  arbitrary  number  of  co variates  (k)  with 
the  following  n  x  k  matrix: 


You  can  simulate  such  a  matrix  in  R  now  using  matrix,  instead  of  cbind: 

n  <-  1034 ;  k  <-  5 

X  <-  matrix(l : (n*h) j  n ,  h) 

head(X) 

##  [,1]  [,2]  [,3]  [,4]  [,  5 ] 

##  [lj]  1  1035  2069  3103  4137 

##  [2}]  2  1036  2070  3104  4138 

##  [3j]  3  1037  2071  3105  4139 

##  [4/]  4  1  038  2072  3106  4140 

##  [5j  ]  5  1039  2073  3107  4141 

##  [6/]  6  1040  2074  3108  4142 

dim(X) 

##  [1]  1034  5 


By  default,  the  matrices  are  filled  in  a  column-by-column  order;  however  using 
the  byrow=TRUE  argument  allows  us  to  change  that  order  to  row-by-row: 

n  <-  1034;  k  <-  5 

X  <-  matrix(l : (n*h) j  n>  hj  byrow=TRUE) 
head(X) 


## 

[A] 

[,2] 

[,3] 

[,4] 

[,5] 

##  [ 1 ,] 

l 

2 

3 

4 

5 

##  [2, ] 

6 

7 

8 

9 

10 

##  [3,] 

11 

12 

13 

14 

15 

##  [ 4 ,] 

16 

17 

18 

19 

20 

##  [ 5 ,] 

21 

22 

23 

24 

25 

##  [ 6 ,] 

26 

27 

28 

29 

30 

dim(X) 

##  [1]  1034  5 
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A  scalar  is  just  a  univariate  number,  which  is  different  from  vectors  and  matrices, 
that  is  usually  denoted  by  lower  case  letters. 


5.5.1  Sample  Statistics  (Mean,  Variance) 

Mean 


To  compute  the  sample  average  and  variance  of  a  dataset,  we  use  the  formulas: 


and 


var(y)=^rE(y'-f)2’ 

i=  1 

which  can  be  represented  with  matrix  multiplication. 
Define  a«  x  1  matrix  made  of  l’s: 


( 1  \ 

A  =  ! 

W 

This  implies  that: 


!-ATy  =  !-(i  l 
n  n 


/M 

Y2 

UJ 


Ey-  =  F- 


i=  1 


Note  that  we  multiply  matrices  by  scalars,  like  using  the  traditional  multipli¬ 
cation  operator  * ,  whereas  we  multiply  two  matrices  using  this  operator  %  *  % : 


#  Using  the  Baseball  dataset 
y  <-  data$Height 
print (mean (y) ) 

##  [1]  73.69729 

n  <-  Length(y) 

Y<-  matrix (yj  1) 

A  <-  matrix(lj  n}  1) 
barY=t(A)%*%Y  /  n 
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print (barY) 

##  [3i] 

##  [I,]  73.69729 

#  double-check  the  result 
mean ( data$Height ) 

##  [1]  73.69729 

Note:  Multiplying  the  transpose  of  a  matrix  with  another  matrix  is  very  common  in 
statistical  modeling  and  computing.  Thus,  there  is  an  R  function  for  this  operation, 
crossprod(): 

barY=crossprod(Aj  Y)  /  n 
print(borY) 

##  [jl] 

##  [lj]  73.69729 


Variance 

For  the  variance,  we  note  that  if: 


Y  = 


Y\  —  Y 


Y  —  Y 

1  n  1 


1 


n  —  1 


Y/TY/  = 


i  n 

1=1 


A  crossprod  with  only  one  matrix  input  computes:  Y  Y  Thus,  to  compute  the 
variance,  we  can  simply  type: 


Y1  <-  y  -  barY 

crossprod(Yl )/(n-l )  #  Yl.man  <-  (l/(n-l))*  t(Yl)  %*%  Y1 

##  [jl] 

##  [ 1 A  5.316798 


Applications  of  Matrix  Algebra:  Linear  Modeling 

Let’s  use  these  matrices: 


(Y A 

<  1 

Xi  \ 

y2 

• 

• 

• 

,x  = 

1 

*2 

,P  = 

^0^ 

|  and  e  = 

£2 

• 

• 

• 

\YnJ 

VI 

%n  ) 

V  £n  / 

Then,  we  can  write  a  simple  linear  model: 


5.5  Scalars,  Vectors  and  Matrices 
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y i  —  Po  +  Xi  -\-  £i,i  —  1,  . . . ,  n 
as: 

fYi\ 

r2 

UJ 

or  simply: 

Y  =  Xp  +  e, 

which  is  a  brief  way  to  write  the  same  model  equation. 

The  optimal  solution  is  achieved  when  all  residuals  (£/)  are  as  small  as  possible 
(indicating  a  good  model  fit).  This  corresponds  to  the  least  squares  (LS)  solution  to 
this  matrix  equation  (Y  =  Xp  +  e),  which  can  be  obtained  by  minimizing  the  residual 
square  error: 


<  eT,  e  >=  (Y  —  XP)T  x  (Y  -  Xp). 

This  can  be  achieved  using  the  following  cross-product: 

P  =  (Y  —  Xp)T(Y  —  Xp). 

We  can  determine  the  values  of  /?  by  minimizing  this  expression,  using  calculus  to 
find  the  minimum  of  the  cost  (objective)  function,  more  about  optimization  is  in 
Chap.  22. 


Finding  Function  Extrema  (Min/Max)  Using  Calculus 

There  are  a  series  of  rules  that  permit  us  to  solve  partial  derivative  equations  in 
matrix  notation.  By  setting  the  derivative  of  a  cost  function  to  zero  and  solving  for 
the  unknown  parameter  /?,  we  obtain  a  candidate  solution(s).  The  derivative  of  the 
above  equation  is: 


2Xt  (Y  -  XP )  =  0 

xTxp  =  xty 

p  =  (xtx)_1xty, 

which  represents  the  desired  solution.  Hat  notation  (A)  is  used  to  denote  estimates. 
For  instance,  the  solution  for  the  unknown  /?  parameters  is  denoted  by  the  (data- 

/s 

driven)  estimate  /? . 


218 


5  Linear  Algebra  &  Matrix  Computing 


The  least  squares  minimization  works  because  minimizing  a  function  corre¬ 
sponds  to  finding  the  roots  of  its  (first)  derivative.  In  the  ordinary  least  squares 
(OLS),  we  square  the  residuals: 


(Y-XP)T(Y-Xp). 


Notice  that  the  minima  of  f(x)  and  f2(x)  are  achieved  at  the  same  roots  of  fix),  as 
the  derivative  of  f2(x)  is  2 f(x)f(x). 


5.5.2  Least  Square  Estimation 


#x=cbind(data$Height ,  data$Age) 
x=data$Height 
y=data$lA/eight 
X  <-  cbind(lj  x) 

beta_hat  <-  soLve(  t(X)  %*%  X  )  %*%  t(X)  %*%  y 
###or 

beto_hot  <-  soLve(  crossprod(X)  )  %*%  crossprod(  Xj  y  ) 


Now  we  can  see  the  results  of  this  by  computing  the  estimated  /?  0  +  /?  xx  for  any 
value  of  v  (Fig.  5.1): 

neiAjx  <-  seq(min(x) j  max(x) ,  Len=100) 

X  <-  cbind(lj  newx) 
fitted  <-  X%*%beta_hat 

pLot(Xj  y j  xLob="lvILB  Player's  Height" ,  ylab="PLayeer  1  s  Weight") 

Lines (newxj  fitted ,  col=2) 

p  =  (xtx)_1xty  is  one  of  the  most  widely  used  results  in  data  analytics.  One 
of  the  advantages  of  this  approach  is  that  we  can  use  it  in  many  different  situations. 


Fig.  5.1  A  linear  model  of 
player’s  weight  as  a  function 
of  their  height  overlayed  on 
the  paired  scatterplot 
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The  R  lm  Function 

R  has  a  very  convenient  function  that  fits  these  models.  We  will  learn  more  about  this 
function  later,  but  here  is  a  preview: 


#  X  <-  cbind(data$Height ,  data$Age)  #  more  complicated  model 

X  <-  data$Height  #  simple  model 

y  <-  data$lAleight 

fit  <-  Lm(y  ~  X) ; 

fit 

Note  that  we  obtain  the  same  estimates  of  the  solution  using  either  the  built-in 
lm()  function  or  using  first-principles. 


5.6  Eigenvalues  and  Eigenvectors 

Eigen-spectrum  decomposition  of  linear  operators  (matrices)  into  eigenvalues  and 
eigenvectors  enables  us  to  easily  understand  linear  transformations.  The  eigen¬ 
vectors  represent  the  “axes”  (directions)  along  which  a  linear  transformation  acts 
by  stretching,  compressing ,  or  flipping.  The  eigenvalues  represent  the  amounts  of 
this  linear  transformation  into  the  specified  eigenvector  direction.  In  higher  dimen¬ 
sions,  there  are  more  directions  along  which  we  need  to  understand  the  behavior  of 
the  linear  transformation.  The  eigen- spectrum  makes  it  easier  to  understand  the 
linear  transformation,  especially  when  many  (perhaps  all)  of  the  eigenvectors  are 
linearly  independent  (orthogonal). 

For  a  matrix  A,  if  we  have  A  v  =  Av  then  we  say  v  (a  non-zero  vector)  is  a  right 
eigenvector  of  the  matrix  A,  and  the  scale  factor  A  is  the  eigenvalue  corresponding  to 
that  eigenvector. 

With  some  calculations  we  know  that  A  v  =  Av  is  the  same  as  (AIn  —  A)  v  =  0 . 
Here  In  is  the  n  x  n  identity  matrix.  So,  when  we  solve  this  equation,  we  get  our 
eigenvalues  and  eigenvectors.  Of  course,  as  this  is  a  very  common  operation,  we 
don’t  need  to  do  that  by  hand  -  the  eigen  ()  function  in  R  help  us  with  this 
calculation. 

mll<-diag(nrow  =  2,  ncoL=2) 
mil 

##  [A]  [A] 

##  [lj]  1  0 

##  [2j  ]  0  1 

eigen(mll ) 

##  $voLues 
##  [1]  1  1 
## 

##  $vectors 
##  [}l]  [A] 

##  [l A  0  -l 

##  [2 A  1  0 
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We  can  use  R  to  prove  that  (XI n  —  A)  v  =  0 . 

(eigen(mll )$vaLues*diag(2) -mil )%*%eigen(mll )$vectors 

##  [,1]  [j2] 

##  [lj]  0  0 

##  [2/]  0  0 

As  we  mentioned  earlier,  diag  (n)  creates  an  n  x  n  identity  matrix.  Thus,  diag 
( 2 )  is  the  I2  matrix  in  the  equation.  The  output  zero  matrix  proves  that  the  equation 
(XI n  —  A)  v  =0  holds  true. 

Some  of  the  many  interesting  applications  of  the  eigen- spectrum  are 

shown  below. 


5.7  Other  Important  Functions 

Other  useful  matrix  operation  are  listed  in  the  following  Table  5.2. 


5.8  Matrix  Notation  (Another  View) 

Some  flexible  matrix  operations  can  help  us  save  time  calculating  row  or  column 
averages.  For  example,  column  averages  can  be  calculated  by  the  following  matrix 
operation. 


Table  5.2  Other  matrix  operators  and  operands 


Functions 

Math  expression  or  explanation 

crossprod (A,  B) 

A  B  where  A,  B  are  matrices 

y<-svd (A) 

The  Singular  Value  Decomposition  output  has  the  fol¬ 
lowing  components 

-y$d 

Vector  containing  the  singular  values  of  A 

-y$u 

Matrix  with  columns  contain  the  left  singular  vectors  of 
A 

-y$v 

Matrix  with  columns  contain  the  right  singular  vectors 
of  A 

k  <  -  qr  (A) 

The  output  has  the  following  components 

-k$qr 

Has  an  upper  triangle  that  contains  the  decomposition 
and  a  lower  triangle  that  contains  information 
on  the  Q  decomposition. 

-k$rank 

Is  the  rank  of  A 

-k$qraux 

A  vector  which  contains  additional  information  on  Q 

-k$pivot 

Contains  information  on  the  pivoting  strategy  used. 

rowMeans  (A)  /colMeans  (A) 

Returns  vector  of  row/column  means 

rowSums  (A)  /colSums  (A) 

Returns  vector  of  row/column  sums 

5.8  Matrix  Notation  (Another  View) 
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AX  =  I  —  1 
N  N 


1 

N 


/XU 

*2,1 


Xi,p\ 

X2,P 


=  (*1  X: 


\Xn,  i  •  •  •  */V,p  / 

While  row  averages  can  be  calculated  by  the  next  operation: 

f  -  \ 

t  ,p  ^ 

X2,p 

\  Xn,  i  •  •  •  / 


XB 


(X  u 
*2,1 


\ 

p 

(xA 

1 

i 

x2 

p 

•  •  • 

J 

T 

\XJ 

\  p  ) 

We  see  that  fast  calculations  can  be  done  by  multiplying  a  matrix  in  the  front  or  at 
the  back  of  the  original  feature  matrix.  In  general,  multiplying  a  vector  in  front  can 
give  us  the  following  equation. 


(x  u 

.  .  . 

Xi,p\ 

AX  =  (  Cl\  <22 

CL]\f  ) 

*2,1 

.  .  . 

X2,P 

VWi 

.  .  . 

Xn,p  ) 

N 

N 

N 

~  ( 'y  Mi,  i 

i=  1 

i=  1 

2  •  •  • 

YMix 

i=  1 

)• 

Now  let’s  do  an  example  to  practice  matrix  notation.  We  will  use  genetic 
expression  data  including  8,793  different  genes  and  208  subjects.  These  gene 
expression  data  represents  a  microarray  experiment — GSE5859 — comparing  Gene 
Expression  Profiles  from  Lymphoblastoid  cells.  Specifically,  the  data  compares  the 
expression  level  of  genes  in  lymphoblasts  from  individuals  in  three  HapMap 
populations  {CEU,  CHB,  JPT}.  The  study  found  that  more  than  1,000  genes  were 
significantly  different  ( a  <  0.05)  in  mean  expression  level  between  the  {CEU}  and 
{CHB  +  JPT}  samples. 

The  gene  expression  profiles  data  has  two  components: 

•  The  gene  expression  intensities  (exprs_GSE5859.csv)  where  the  rows  represent 
features  on  the  microarray  (e.g.,  genes),  and  columns  represent  different  micro¬ 
array  samples, 

•  Meta-data  about  each  of  the  samples  (exprs_MetaData_GSE5859.csv)  where 
rows  represent  samples,  and  columns  represent  meta-data  (e.g.,  sex,  age,  treat¬ 
ment  status,  the  date  the  sample  processing). 

gene<-read,  csv (" https :// umich .  instructure .  com/ fiLes/2001417/doiAjn  Load  ?do\AjnLoa 
d_frd=l”j  header  =  T)  #  expns_GSE5859 . csv 

inf o< -read,  csv  ("  https ://  umich .  instructure .  com/ files  / 2001418/ down  Load  ?downLoa 
d_frd=l"j  header=T)  #  expns_MetaData_GSE5859 . csv 
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Like  the  lapply  ( )  function  that  we  will  discuss  in  Chap.  7,  the  sapply  ( ) 
function  can  be  used  to  calculate  column  and  row  averages.  Let’s  compare  the  output 
by  using  sapply  for  first-principles  matrix  calculations. 

coLmeans<-sappLy(gene[j  -1]}  mean) 
genel<-as.matrix(gene[j  -1]) 

#  can  also  use  built  in  funcitons 

#  colMeans  <-  colMeans(genel) 

coimeans .matrix<-crossprod(rep(l/nrow(genel ) }  nrow(genel ) ) }  genel ) 
coLmeans[l :15] 


## 

GSM25581 . CEi . gz 

GSM25681 . CEi . gz 

GSM136524. CEi.gz 

GSM136707. CEi.gz 

## 

5.703998 

5.721779 

5.726300 

5.743632 

## 

GSM25553 .CEi.gz 

GSM136676. CEi.gz 

GSM136711. CEi.gz 

GSM136542. CEi.gz 

## 

5.835499 

5.742565 

5.751601 

5.732211 

## 

GSM136535. CEi.gz 

GSM25399. CEi.gz 

GSM25552. CEi.gz 

GSM25542. CEi.gz 

## 

5.741741 

5.618825 

5.805147 

5.733117 

## 

GSM136544. CEi.gz 

GSM25662. CEi.gz 

GSM136563. CEi.gz 

## 

5.733175 

5.716855 

5.750600 

co Lmeans .matrix [1 :15] 

##  [1]  5.703998  5.721779  5.726300  5.743632  5.835499  5.742565  5.751601 
##  [8]  5.732211  5.741741  5.618825  5.805147  5.733117  5.733175  5.716855 
##  [15]  5.750600 

The  same  output  is  reported.  Here,  we  use  rep  (1/nrow  (genel)  ,  nrow 
(genel )  )  to  create  the  vector 


1  1  1 

N  N  ’  ”  N 

needed  to  obtain  the  column  averages.  We  may  visualize  the  column  means  using  a 
histogram  (Fig.  5.2). 

coLmeans<-as ,matrix( coimeans ) 
hist( coimeans ) 

The  histogram  shows  that  the  distribution  is  mostly  symmetric  and  bell  shaped. 
We  can  address  harder  problems  using  matrix  notation.  For  example,  let’s  calculate 
the  differences  between  genders  for  each  gene.  First,  we  need  to  get  the  gender 
information  for  each  subject. 

gender<-info[j  c(3 ,  4)] 
rownames ( gender )<- gender $fi i ename 

Then,  we  have  to  reorder  the  columns  to  make  then  consistent  with  the  feature 
matrix  genel. 


gender< -gender [coinames (genel ) }  ] 

After  that,  we  will  construct  the  design  the  matrix  and  multiply  it  by  the  feature 
matrix.  The  plan  is  to  multiply  the  following  two  matrices. 


5.8  Matrix  Notation  (Another  View) 
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Fig.  5.2  Histogram  of  the 
column  means  of  the  gene 
expression  data 


5.60  5.65 


5.70  5.75  5.80 

colmeans 


/Xhl  ... 

V,p\ 

%2, 1 

V  Wv,  i 

Xn,p  / 

U\  \ 

(  X\  gender. diffx  \ 

— 

X2  gender.diff2 

dp  j 

V  Xq  gender. diffN  ) 

where  at  =  —  HNF  if  the  subject  is  female  and  at  =  HNM  if  the  subject  is  male.  Thus, 
we  gave  each  female  and  male  the  same  weight  before  the  subtraction.  We  average 
each  gender  and  get  their  difference.  Xj  represents  the  average  across  both  genders 
and  gender .  difft  represents  the  gender  difference  for  the  ith  gene. 


table (gender$sex) 

## 

##  F  M 
##  86  122 

gender$vector<-i.feLse(gender$sex=="F"j  -1/86 ,  1/122) 

vecl <-as. matrix( data. frame ( rowavg=rep ( 1/ncoL ( genel ) j  ncol ( genel ) ) j 

gender . diff=gender$vector) ) 

gender . matrix<-genel%*%vecl 

gender .matrix[l : 15 j  ] 


## 

rowavg 

gender .diff 

## 

[1,] 

6.383263 

-0.003209464 

## 

[2, 7 

7.091630 

-0.031320597 

## 

75,7 

5.477032 

0.064806978 

## 

74,7 

7. 584042 

-0.001300152 

## 

[5, 7 

3.197687 

0.015265502 

## 

7^,7 

7.338204 

0.078434938 

## 

77,7 

4.232132 

0.008437864 

## 

78,7 

3.716460 

0.018235650 

## 

79,  J 

2.810554 

-0.038698101 

##  [10,] 

5.208787 

0.020219666 

##  [11/] 

6.498989 

0.025979654 

##  /-i2,7 

5.292992 

-0.029988980 

##  [13,] 

7.069081 

0.038575442 

##  714,7 

5.952406 

0.030352616 

##  [15,  7 

7.247116 

0.046020066 
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5.9  Multivariate  Linear  Regression 

As  we  mentioned  earlier,  the  formula  for  multivariate  linear  regression  can  be 
written  as 


Yi  —  Po  +  W,  \P\  +  '  '  '  +  XUpPp  +  £;,/  —  1,  •  •  •  ,  N . 
We  can  rewrite  this  in  a  matrix  form. 


fy/ 

Y2 

<  1  > 

1 

A)  + 

fXi,A 

X2,I 

Pi  +  .  .  .  -f- 

X2,r 

Pp  + 

fei\ 

£2 

UJ 

V’i'y 

\XNA ) 

\Xn,pJ 

\eN  ) 

Which  is  the  same  as  Y  =  Xp  +  e  or 


(y a 

(  1  Xu  ... 

(Po\ 

A,  \ 

Y2 

— 

1  X2,  i  ...  x2'P 

h 

+ 

£2 

\Yn ) 

•  ••  •••  •••  ••• 

\  1  Xn,  1  •  •  •  Xn,p  ) 

KPp) 

V  f-  N  / 

Y  =  Xp  +  e  implies  that  XTY  ~  XT(X/J)  =  (XTX)P,  and  thus  the  solution  for  ft  is 

rp  _ i 

obtained  by  multiplying  both  hand  sides  by  ( X  X)  : 

p  =  (xtx)~1xty. 

Matrix  calculation  would  be  faster  than  fitting  a  regression  model.  Let’s  apply 
this  to  the  Lahman  baseball  data  representing  yearly  stats  and  standings.  Let’s 
download  the  baseball.data  (https://umich.instructure.com/files/2018445/download? 
download_frd=l)  and  put  it  in  the  R  working  directory.  We  can  use  the  load  () 
function  to  load  a  local  RData  object.  For  this  example,  we  subset  the  dataset  by 
G==162  and  yearID<2  002.  Also,  we  create  a  new  feature  named  Singles  that 
is  equal  to  H  (Hits  by  batters)  -  X2B  (Doubles)  -  X3B  (Tripples)  -  HR 
(Homeruns  by  batters) .  Finally,  we  only  pick  some  features:  R  (Runs  scored), 
Singles,  HR  (Homemns  by  batters)  and  BB  (Walks  by  batters). 


#If  you  downloaded  the  .RData  locally  first,  then  you  can  easily  load  it 
into  the  R  workspace  by: 

#  load( "Teams . RData" ) 

#  Alternatively  you  can  also  download  the  data  in  CSV  format  from 

http : //umich . instructure . com/courses/38100/f iles/f older/data  (teamsData . csv) 
Teams  <-  read.csv( ' https : //umich . instructure . com/ files/ 2798317 /dovm Load Pdown 
Load_frd=l ',  header=T) 

dat< -Teams [Teams$G==162&Teams$y earID<2002 ,  ] 
dat$Singi es<- dat$H-dat$X2B - dat$X3B - dat$HR 
dat<-dat[j  c("R ",  "Singles",  "HR",  "BB")] 
head(dat) 
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## 

##  439  505 
##  1367  683 
##  1368  744 
##  1378  652 
##  1380  707 
##  1381  632 


HR  BB 
997  11  344 

989  90  580 

902  189  681 
948  156  516 
1017  92  620 

1020  126  504 


R  Singles 


Now  let’s  do  a  simple  example.  We  will  use  runs  scored  (R)  as  the  response 
variable  and  batters  walks  (BB)  as  the  independent  variable.  Also,  we  need  to  add  a 
column  of  l’s  to  the  X  matrix. 


Y<-dat$R 

X<-cbind(rep(lj  n=nrow(dat) ) j  dat$BB) 
X[1:10J  ] 

##  [A]  [A] 

##  [lj]  1  344 

##  [2}]  1  580 

##  [3j]  1  681 

##  [4/]  1  516 

##  [5j ]  1  620 

##  [6,]  1  504 

##  [7/]  1  498 

##  [8/]  1  502 

##  [9/]  1  493 

##  [10 j ]  1  556 


Let’s  solve  for  the  effect-sizes  (the  beta  coefficients)  by 

p  =  ( xtx)~1xty . 

beta<-solve(t(X)%*%X)%*%t(X)%*%Y 

beta 

##  [jl] 

##  [lj]  326.8241628 
##  [2j]  0.7126402 


To  examine  this  manual  calculation,  we  refit  the  linear  equation  using  the  lm  ( ) 
function.  After  comparing  the  time  used  for  computations,  we  may  notice  that  matrix 
calculation  are  more  time  efficient. 


fit<-Lm(R~BBj  data=dat) 

#  fit<-lm(R~.,  data=dat) 

#  '  . '  indicates  all  other  variables,  very  useful  when  fitting  models 
with  many  predictors 

fit 

## 

##  Call: 

##  Lm( formula  =  R  ~  BB}  data  =  dat) 

## 


226 


5  Linear  Algebra  &  Matrix  Computing 


##  Coefficients : 

##  (Intercept)  BB 

##  326.8242  0.7126 

summary (fit) 

##  Call: 

##  Lm( formula  =  R  ~  BB,  data  =  dat) 

## 

##  Residuals : 

##  Min  IQ  Median  3Q  Max 

##  -187.788  -53.977  -2.995  55.649  258.614 

## 

##  Coefficients : 

##  Estimate  Std.  Error  t  value  Pr(>/tl) 

##  (Intercept)  326.82416  22.44340  14.56  <2e-16  *** 

##  BB  0.71264  0.04157  17.14  <2e-16  *** 

##  --- 

##  Signif.  codes:  0  '***'  0.001  '**'  0.01  0.05  '.'0.1  '  '  1 

## 

##  Residual  standard  error:  76.95  on  661  degrees  of  freedom 
##  Multiple  R-squared :  0.3078,  Adjusted  R-squared:  0.3068 

##  F-statistic :  294  on  1  and  661  DF ,  p-value:  <  2.2e-16 

system. time(fit<-lm(R~BBj  data=dat) ) 

##  user  system  elapsed 

##000 

system,  time (beta<- solve (t(X)%*%X)%*%t(X)%*%Y) 

##  user  system  elapsed 

##000 

We  can  visualize  the  relationship  between  R  and  BB  by  drawing  a  scatter  plot 
(Fig.  5.3). 


Fig.  5.3  Scatterplot  and 
model  of  walks  (BB)  and 
runs  (R)  by  batters,  using  the 
MLB  dataset 
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Fig.  5.4  3D  Scatterplot  of  walks  (BB),  homeruns  (HR),  and  runs  (R)  by  batters,  using  the  baseball 
dataset 


pLot(dat$BBj  dat$Rj  xiob  =  "BB" }  yiab  =  "R"j  main  =  "Scatter  pLot/regression 
for  basebaLL  data") 

abLine(beta[lj  1],  beta[2j  1]}  Lwd=4j  coL="red") 

On  Fig.  5.3,  the  red  line  is  our  regression  line  calculated  by  matrix  calculation. 
Matrix  calculation  can  still  work  if  we  have  multiple  independent  variables.  Next,  we 
will  add  another  variable,  HR,  to  the  model,  Fig.  5.4. 

X<-cbind(rep(lj  n=nrow(dat) ) j  dat$BBj  dat$HR) 

beta<-soL  ve(t(X)%*%X)%*%t  (X)%*%Y 

beta 

##  [jl] 

##  [lj]  287.7226756 
##  [2,  ]  0.3897178 
##  [3j ]  1.5220448 

ttinstall . pac kages (" scatterplot 3d" ) 

Library ( scatterpiot3d) 
scatterpiot3d(dat$BBj  dat$HRj  dat$R) 


5.10  Sample  Covariance  Matrix 

We  can  also  obtain  the  covariance  matrix  for  our  features  using  matrix  operations. 
Suppose 
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Then  the  covariance  matrix  is: 


where  ZUj  =  Cov(Xt,Xj)  =  E«Xt  -  ^)(Xj  -  fij)),  1  <  ij,  <  N. 
The  sample  covariance  matrix  is: 


i  A 


m=  1 


where 


In  general, 


1 


(x-x)T(x-x). 


z  = 


n  —  1 


Assume  we  want  to  get  the  sample  covariance  matrix  of  the  following  5x3 
feature  matrix  x. 


x 


##  [A]  [A]  [A] 

##  [lj]  4.0  2.0  0.60 

##  [2/]  4.2  2.1  0.59 

##  [3j]  3.9  2.0  0.58 

##  [4, ]  4.3  2.1  0.62 

##  [5j ]  4.1  2.2  0.63 

Notice  that  we  have  three  features  and  five  observations  in  this  matrix.  Let’s  get 
the  column  means  first. 

vec2<-matrix(c(l/5j  1/5 ,  1/5 ,  1/5,  1/5),  ncoi=5) 

#column  means 
x. bar<-vec2%*%x 
x.  bar 

##  [,1]  [,2]  [,3] 

##  [1,]  4.1  2.08  0.604 

Then,  we  repeat  each  column  mean  5  times  to  match  the  layout  of  feature  matrix. 
Finally,  we  are  able  to  plug  everything  in  the  formula  above. 


5.11  Assignments:  5.  Linear  Algebra  &  Matrix  Computing 


229 


x. bar<-matrix(rep(x. bar,  each=5) ,  nrow=5) 

S<-l/4*t(x-x. bar)%*%(x-x. bar) 

S 

##  [A]  [A]  [A] 

##  [lj]  0.02500  0.00750  0.00175 
##  [2  A  0.00750  0.00700  0.00135 
##  [3/]  0.00175  0.00135  0.00043 

In  the  covariance  matrix,  S  [  i ,  i  ]  is  the  variance  of  the  ith  feature  and  S  [  i  ,  j  ] 
is  the  covariance  of  ith  and  jth  features. 

Compare  this  to  the  automated  calculation  of  the  variance-covariance  matrix. 


autoCov  <-  cov(x) 
autoCov 

##  [A]  [A]  [A] 
##  [lj]  0.02500  0.00750  0.00175 
##  [2 }]  0.00750  0.00700  0.00135 
##  [3 A  0.00175  0.00135  0.00043 


5.11  Assignments:  5.  Linear  Algebra  &  Matrix  Computing 

5.11.1  How  Is  Matrix  Multiplication  Defined? 

Validate  that  (A^n  x  Bn^m)T  =  [Bjn  ;1)  x  by  using  math  notation,  as  well  as 

by  using  R  functions. 


5.11.2  Scalar  Versus  Matrix  Multiplication 

Demonstrate  the  differences  between  the  scalar  multiplication  (*)  and  matrix 
multiplication  (%  *  %)  for  numbers,  vectors,  and  matrices  (second-order  tensors). 


5.11.3  Matrix  Equations 

Write  a  simple  matrix  solver  ( b  =  Ax,  i.e.,  v  =  A~lb)  and  validate  its  acuracy  using 
the  R  command  solve  (A,  b) .  Solve  this  equation: 

2a  —  b  +  2c  =5 
—a  —  2bJrc  =  3 . 
a  A  b  —  c  =2 
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5.11.4  Least  Square  Estimation 

Use  the  SOCR  Knee  Pain  dataset,  extract  the  RB  =  Right -Back  locations  (A,y), 
and  fit  in  a  linear  model  for  vertical  locations  (y)  in  terms  of  the  horizontal  locations 
(A).  Display  the  linear  model  on  top  of  the  scatter  plot  of  the  paired  data.  Comment 
on  the  model  you  obtain. 


5.11.5  Matrix  Manipulation 

Create  a  matrix  A  with  elements  seq  ( 1 ,  15 ,  length  =  6 )  and  argument  nrow 
=  3 .  Then,  add  a  row  to  this  matrix  and  add  two  columns  to  A  to  obtain  a  matrix 
C4  4.  Next,  generate  a  diagonal  matrix  D  with  dim  =  4  and  elements  rnorm 
(4,0,1).  Apply  elementwise  addition,  subtraction,  multiplication,  and  division 
to  the  matrix  C  and  D;  apply  matrix  multiplication  to  D  and  C;  obtain  the  inverse  of 
the  C  and  compare  it  with  the  generalized  inverse,  MASS  :  :  ginv  ( ) . 


5.11.6  Matrix  Transpose 

Tl  T1  rJ~1 

Validate  the  multiplication  transposition  formula,  (A^n  '  Bn,m)  —Bnm  'Akn,  by 
using  math  notation,  as  well  as  computationally  using  R  and  some  example  matrices. 
E.g.  you  can  try 

A  =  matrix(l : 6, nnow=3) ;  B  =  matrix(2:7.,  nrow  =  2) 


5.11.7  Sample  Statistics 

Use  the  SOCR  Data  Iris  Sepal  Petal  Classes  and  extract  the  rows  of  setosa 
flowers.  Compute  the  sample  mean  and  variance  of  each  variables;  then  calculate 
sample  covariance  and  correlation  between  sepal  width  and  sepal  height. 


5.11.8  Least  Square  Estimation 

Use  the  SOCR  Knee  Pain  dataset,  extract  the  RB  =  Right -Back  locations  (A,y), 
and  fit  in  a  linear  model  for  vertical  location  (y)  in  terms  of  the  horizontal 
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location  (v).  Display  the  linear  model  on  top  of  the  scatter  plot  of  the  paired  data. 
Comment  on  the  model  you  obtained. 


5.11.9  Eigenvalues  and  Eigenvectors 


Generate  a  random  matrix  with  A  =  matrix  (rnorm  (9,0,1)  ,  nrow  =  3 ) , 
compute  eigenvalues  and  eigenvectors  for  A;  then  try  to  solve  this  equation  det 
(A  —  XI)  =  0,  where  A  is  a  vector  of  length  3.  Compare  A  and  the  eigenvalues  you 
solved  above. 

Example  of  manual  and  automated  calculations  of  eigen- spectra  (eigenvalues  and 
eigenvectors): 


#  A  <-  matrix(rnorm(9J0Jl)Jnrovj  =  3);  A 

#  define  a  random  design  matriXj  may  generate  complex  solutions 
A  <-  matrix(c (0, 1/4, 1/4, 3/4, 0, 1/4, 1/4, 3/4, 1/2) J 3, 3, byrow=T) ;  A 
eigen_spectrum  <-  eigen (A);  eigen_spectrum 

#  compute  the  eigen  spectrum  ( eigen-values ,  $ l $,  and  eigen-vectors ,  $v$), 

#  $  A  \times  v  =  L  \times  v$. 

B  <-  A-eigen(A)$values*diag(3) ;  B 

#  compute  B  =  (A  -  eigenvalue  \times  I) 
det(A-eigen(A)\$values*diag(3) ) 

#  verrify  that  the  det(A-eigen(A) \$values*diag(3) )  is  not  trivial  ($0$) 
A%*%eigen(A)$vector  -  eigen(A)$value*diag(3) 

#  validate  that  $  A  \times  v  =  L  \times  v$. 

all .  equal(Aj  eigen(A) \$vector  %*%  diag(eigen(A)\$values)  %>*% 
solve(eigen(A)$vector) )  #  compare  A  =  v*L*inv(v) 

all . equal(diag(3) j  A%*%eigen(A)$vector  -  eigen(A)$values  *  eigen(A)$vector) 

#  The  Last  Line  compares  I  ==  AV  -  Lambda\*Vj  mind  the  $*$  and 

#  $%*%$  scalar  and  matrix  operators 
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Chapter  6 

Dimensionality  Reduction 


® 

Check  for 
updates 


Now  that  we  have  most  of  the  fundamentals  covered  in  the  previous  chapters,  we  can 
delve  into  the  first  data  analytic  method,  dimension  reduction ,  which  reduces  the 
number  of  features  when  modeling  a  very  large  number  of  variables.  Dimension 
reduction  can  help  us  extract  a  set  of  “uncorrelated”  principal  variables  and  reduce 
the  complexity  of  the  data.  We  are  not  simply  picking  some  of  the  original  variables. 
Rather,  we  are  constructing  new  “uncorrelated”  variables  as  functions  of  the  old 
features. 

Dimensionality  reduction  techniques  enable  exploratory  data  analyses  by  reduc¬ 
ing  the  complexity  of  the  dataset,  still  approximately  preserving  important  proper¬ 
ties,  such  as  retaining  the  distances  between  cases  or  subjects.  If  we  are  able  to 
reduce  the  complexity  down  to  a  few  dimensions,  we  can  then  plot  the  data  and 
untangle  its  intrinsic  characteristics. 

We  will  (1)  start  with  a  synthetic  example  demonstrating  the  reduction  of  a  2D 
data  into  ID;  (2)  explain  the  notion  of  rotation  matrices;  (3)  show  examples  of 
principal  component  analysis  (PC A),  singular  value  decomposition  (SVD),  inde¬ 
pendent  component  analysis  (ICA)  and  factor  analysis  (FA);  and  (4)  present  a 
Parkinson’s  disease  case-study  at  the  end.  The  supplementary  DSPA  electronic 
materials  for  this  chapter  also  include  the  theory  and  practice  of  t-Distributed 
Stochastic  Neighbor  Embedding  (t-SNE),  which  represents  high-dimensional  data 
via  projections  into  non-linear  low-dimensional  manifolds. 


6.1  Example:  Reducing  2D  to  ID 

We  consider  an  example  looking  at  twin  heights.  Suppose  we  simulate  1000  2D 
points  that  representing  normalized  individual  heights,  i.e.,  number  of  standard 
deviations  from  the  mean  height.  Each  2D  point  represents  a  pair  of  twins.  We  will 
simulate  this  scenario  using  Bivariate  Normal  Distribution  (Table  6.1  and  Fig.  6.1). 
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Table  6.1  Schematic  data 
structure  representation  and 
indexing  of  twin  heights 


Twin  1 Height 

Twin  2 Height 

Ml.  2] 

>12, 2 1 

M3, 1] 

>[3,2] 

.  .  . 

M  500,1] 

>[500,2] 

Fig.  6.1  Scatterplot  of 
paired  twin  heights.  The  red 
points  show  the  heights  of 
the  first  two  pairs  of  twins 


Twin  1  (standardized  height) 


Library (MASS) 

set, seed (1234) 
n  <-  1000 

y=t(mvrnorm(n,  c(0,  0),  matrix(c(l,  0.95,  0.95 ,  1),  2 ,  2))) 


T 

^2x500 


>[i,] 

AA 


Twirl  1  Height 

( 

Twin  1  Height 

=  BVN  u  = 

Twin! Height  _ 

V 

_  Tw i H  ^Height  _ 

x  = 


0.95 


pLot(y[lj  ],  y[2j  ],  xiab="T\Aiin  1  (standardized  height) ", 

yLab="Twin  2  (standardized  height)" j  xiim=c(-3,  3),  yiim=c(-3}  3)) 
points(y[lj  1:2] ,  y[2j  1:2],  coL=2,  pch=16)  #  pLot  the  first  2  points 

These  data  may  represent  a  fraction  of  the  information  included  in  a  high- 
throughput  neuroimaging  genetics  study  of  twins,  like  the  pediatric  study  example 
shown  here  (http://wiki.socr.umich.edu/index.php/SOCR_Data_Oct2009_ID_NI). 

Tracking  the  distances  between  any  two  samples  can  be  accomplished  using  the 
dist  function.  For  example,  here  is  the  distance  between  the  two  RED  points  in  the 
Fig.  6.1: 


d=dist( t(y) ) 
as. matrix (d)[l,  2] 


##  [1]  2.100187 


6.1  Example:  Reducing  2D  to  ID 
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Fig.  6.2  Scatterplots  of  the  raw  twin  heights 


Average  height 


Fig.  6.3  Scatterplots  of  the  transformed  twin  heights,  compare  to  Fig.  6.2 


To  transform  the  2D  data  to  a  simpler  ID  plot,  we  can  reduce  the  data  to  a  ID 
matrix  (vector)  approximately  preserving  the  distances  between  the  2D  points. 

The  2D  plot  shows  the  Euclidean  distance  between  the  pair  of  red  points,  Fig.  6.2. 
The  length  of  this  line  is  the  distance  between  the  two  points.  In  2D,  these  lines  tend 
to  go  along  the  direction  of  the  diagonal.  If  we  rotate  the  plot  so  that  the  diagonal 
is  in  the  x-axis  (Fig.  6.3): 


zl  =  (y[lj  ]+y[2j  ])/2  #  the  sum  (on  rather  average) 
z2  =  (y[lj  ]-y[2j  ])  #  the  difference 

z  =  rbind(zlj  z2)  #matrix  now  same  dimensions  as  y 

theiim  <-  c( -3}  3) 

#  par(mar=c(l,  2)) 

#  par(mfrow=c(2J 1) ) 

pLot(y[lj  ]j  y[2j  ],  xiab="Twin  1  (standardized  height) ", 
ylab="Tviin  2  (standardized  height) ", 
xLim=theLimj  ylim=thelim) 
points(y[lj  1:2] ,  y[2 ,  1:2] }  col=2}  pch=16) 
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par(mfro\Aj=c(ljl)  ) 

pLot(z[lj  ]j  z[2j  ],  xLim=theLimj  yLim=theLinij  xiab=" Average  height ",  yLab=" 
Difference  in  height") 

points(z[lj  1:2] ,  z[2}  1:2 J,  coL=2j  pch=16) 

Of  course,  matrix  linear  algebra  notation  can  be  used  to  represent  this  affine 
transformation  of  the  data.  Here  we  can  see  that  to  get  z  we  multiplied  y  by  the 
matrix: 


We  can  invert  this  transform  by  multiplying  the  result  by  the  inverse  matrix 


A  as  follows: 


You  can  try  this  in  R: 


A  <-  matrix(c(l/2j  1 >  l/2}  -1),  nrow=2j  ncoi=2)j  A  #  define  a  matrix 

##  [A]  [A] 

##  [1,  ]  0.5  0.5 

##  [2/]  1.0  -1.0 

A_inv  <-  soLve(A);  A_inv  #  inverse 

##  [}1]  [,2] 

##  [lj]  1  0.5 

##  [2/]  1  -0.5 

A  %*%  >4_inv  #  Verify  result 

##  [A]  [,2] 

##  [i }]  l  0 

##  [2  A  0  1 

Note  that  this  matrix  transformation  did  not  preserve  distances,  i.e.,  the  matrix 
A  is  not  a  simple  rotation  in  2D: 


d=dist(t(y) ) ;  as .matrix(d) [lj  2]  #  distance  between  first  two  points  of  Y 

##  [1]  2.100187 

dl=dist(t(z))j  as.matrix(dl) [lj  2]  #  distance  between  first  two  points  of 
Z=A*Y 


##  [1]  1.541323 


6.2  Matrix  Rotations 
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6.2  Matrix  Rotations 

One  important  question  to  ask  is  how  we  can  identify  transformations  that  preserve 
distances.  In  mathematics,  transformations  between  metric  spaces  that  are  distance¬ 
preserving  are  called  isometries  (or  congruences  or  congruent  transformations). 
First,  let’s  test  the  MA  transformation  we  used  above  (Fig.  6.3): 

M=Yl-Y2 
A  Yi  +  Y2 


MA  <-  matrix ( c(l/2}  1 ,  1/2,  -1),  2,  2) 

MA_z  <-  MA%*%y 
d  <-  dist(t(y) ) 
d_MA  <-  dist(t(MA_z) ) 

plot (as, numeric (d) ,  as, numeric (d_MA) ) 
abiine(0j  coi=2) 


Observe  that  this  MA  transformation  is  not  an  isometry  -  the  distances  are  not 


vlx  =  0 
Vly  =  1 


v2  = 


v2x  =  1 

V2};  =  0 


,  which  are 


preserved.  Here  is  one  example  with  vl  = 

distance  y/2  apart  in  their  native  space,  but  separated  further  by  the  transformation 
MA,  d(MA(vl),MA(v2))  =  2. 


Fig.  6.4  The  above  MA 
transformation  is  not  an 
isometry.  This  scatterplot 
shows  that  the  relation 
between  the  transformed 
(y-axis)  and  the  native-space 
(x-axis)  twin-pairs  distances 
are  not  preserved 


as.numeric(d) 
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MA ;  t(MA)j  soLve(MA) ;  t(MA)  -  soLve(MA) 

##  [A]  [j  2  ] 

##  [lj ]  0.5  0.5 

##  [2 j ]  1.0  -1.0 

##  [A]  [j 2 ] 

##  [1 A  0.5  1 

##  [2 }]  0.5  -1 

##  [A]  [A] 

##  [lj ]  1  0.5 

##  [2 A  1  -0.5 

##  [A]  [A] 

##  [1A  -0.5  0.5 

##  [2 A  -0.5  -0.5 

vl  <-  c(0jl)j  v2  <-  c(lj0);  rbind(vljV2) 

##  [A]  [A] 

##  vl  0  1 

##  \>2  1  0 

euc.dist  <-  function(xlj  x2)  sqrt(sum( (xl  -  x2)  *2)) 
euc.dist( vlj  v2) 

##  [1]  1.414214 

vl_t  <-  MA  %*%  vlj  v2_t  <-  MA  %*%  v2 
euc.dist ( vl_tj  v/2_tj 

##  [1]  2 

More  generally,  if 


Then, 


z  =AY  +  rj  ~  BVN(tj+A^i,AZAT). 

Where  BVN  denotes  bivariate  normal  distribution  (see  http://socr.umich.edu/ 
HTML5/B  ivariateN  ormal/) , 

A  =  (“  d)-' '  =  =  ;?)■ 

You  can  verify  this  by  using  the  change  of  variable  theorem.  Thus,  affine  trans¬ 
formations  preserve  bivariate  normality.  However,  in  general,  there  is  no  means  to 
guarantee  isometry. 

The  question  now  is:  Under  what  additional  conditions  for  a  transformation 
matrix  A,  can  we  guarantee  an  isometry? 


6.2  Matrix  Rotations 
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Notice  that, 


d2{Pi,Pj)  =  E  to*  -  P*f  =  \\p\\2  =  pTp’ 


k=  1 


where  P  =  (Pyj  —  Pz  i, . . .,  Py  r  —  Pz  r)  ,  Pz  and  Py  is  any  two  points  in  T dimensions. 

t1  y  y 

Thus,  the  only  requirement  we  need  is  (AY)  (AY)  =  Y  Y,  i.  e.  ,  A  A  =  /,  which 
implies  that  A  is  an  orthogonal  (rotational)  matrix. 

Let’s  use  a  two  dimension  orthogonal  matrix  to  illustrate  this  concept.  Set 

A  =  ^  ^  j  11  ^  .  It’s  easy  to  verify  that  A  is  an  orthogonal  (2D  rotation)  matrix. 

The  simplest  way  to  test  the  isometry  is  to  perform  the  linear  transformation 
directly  (Fig.  6.5). 


A  <-  l/sqrt(2)*matrix(c(lJ  1 ,  1 ,  -1)J  2}  2) 
z  <-  A%*%y 
d  <-  dist(t(y)) 
d2  <-  dist(t(z) ) 

plot (as. numeric (d) ,  as .numeric (d2) ) 
abLine(0j  1 ,  coL=2) 

We  can  observe  that  the  distances  computed  using  the  original  data  are  preserved 
after  the  transformation.  This  transformation  is  called  a  rotation  (isometry)  of  y.  Note 
the  difference  compared  to  the  earlier  plot,  Fig.  6.4. 

An  alternative  method  is  to  simulate  from  the  joint  distribution  of  Z  =  (Zj,  Z2)  . 
As  we  have  mentioned  above: 

Z  =  AY +  ri~  BVN(t]  +  Aft,  AZAT), 
where  r,  =  (0,  Of,  Z  = 


Fig.  6.5  The  matrix 
A  transformation  above  is 
distance  preserving  (i.e.,  an 
isometry),  as  illustrated  by 
the  perfect  linear  relation 
between  the  native- space 
and  the  transformed  pairs  of 
twin  height  distances 


as.numeric(d) 
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Fig.  6.6  QQ-plot  of  the 
distanced  between  twin 
heights  (d)  and  distances 
between  the  simulated 
bivariate  Normal 
distribution  data  (d3) 


d 
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We  can  compute  AX  A  by  hand  or  by  using  matrix  multiplication  in  R: 


sig  <-  matrix(c(l,  0.95,0. 95, 1) ,nro\Ai=2) 

A%*%s  i  g%*%t  (A) 

##  [,1]  [j  2  ] 

##  [1,]  1.95  0.00 
##  [2,]  0.00  0.05 

nr 

AXA  represents  the  transformed  variance-covariance  matrix,  cov{z\,  Z2)  =  0-  We 
can  simulate  z i,  Z2  independently  from  z\  ~  N(0,1.95)  and  Z2  ~  A^(0,0.05)  (Note: 
independence  and  uncorrelation  are  equivalent  for  bivariate  normal  distribution) 
(Fig.  6.6). 


set .seed (2017) 

zzl  =  rnorm(1000J0Jsd  =  sqrt(1.95) ) 

zz2  =  rnorm(1000J0Jsd  =  sqrt(0.05) ) 

zz  =  rbind(zzlj zz2) 

d3  =  dist( t(zz) ) 

qqpLot(dj d3) 

abLine(a  =  0}  b=l}  coL=2) 

We  can  observe  that  the  distances  computed  using  the  original  data  and  the 
simulated  data  are  the  same  (Figs.  6.7  and  6.8). 

theLim  <-  c( -3,  3) 

#par(mf row=c(2, 1) ) 

pLot(y[lj  ]j  y[2j  ]}  xlab="Twin  1  (standardized  height)" } 
yLab="Twin  2  (standardized  height) ", 
xLim=theLiirij  yiim=theiim) 
points(y[lj  1:2],  y[2,  1:2],  coL=2 ,  pch=16) 


piot(z[l,  ],  z[2,  ],  xLim=theLim,  ylim=thelim,  xlab=" Average  height", 

yiab=" Difference  in  height") 

points (z[l,  1:2],  z[2,  1:2],  coL=2,  pch=16) 
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Fig.  6.7  Twin  height  scatterplot  before  rotation 
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Fig.  6.8  Twin  height  scatterplot  after  the  rotation 

We  applied  this  transformation  and  observed  that  the  distances  between  points 
were  unchanged  after  the  rotation.  This  rotation  achieves  the  goals  of: 

•  Preserving  the  distances  between  points,  and 

•  Reducing  the  dimensionality  of  the  data  (see  plot  reducing  2D  to  ID). 

Removing  the  second  dimension  and  recomputing  the  distances,  we  get 
(Fig.  6.9): 

d4  =  dist(z[lj  ])  ##distance  computed  using  just  the  first  dimension 
plot (as. numeric (d) j  as .numeric (d4) ) 
abLine(0j  1) 

The  ID  distances  provide  a  very  good  approximation  to  the  actual  2D  distances. 
This  first  dimension  of  the  transformed  data  is  called  the  first  principal 
component.  In  general,  this  idea  motivates  the  use  of  principal  component  analysis 
(PCA)  and  the  singular  value  decomposition  (SVD)  to  achieve  dimensionality 
reduction. 
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asnumeric(d) 


Fig.  6.9  Comparing  the  twin  distances,  computed  using  just  one  dimension,  following  the  rotation 
transformation  against  the  actual  twin  pair  height  distances.  The  strong  linear  relation  suggests  that 
measuring  distances  in  the  native  space  is  equivalent  to  measuring  distances  in  the  transformed 
space,  where  we  reduced  the  dimension  of  the  data  from  2D  to  ID 


6.3  Notation 

In  the  notation  above,  the  rows  represent  variables  and  columns  represent  cases.  In 
general,  rows  represent  cases  and  columns  represent  variables.  Hence,  in  our  exam¬ 
ple  shown  here,  Y  would  be  transposed  to  be  a  N  x  2  matrix.  This  is  the  most 
common  way  to  represent  the  data:  individuals  in  the  rows,  features  in  the  columns. 
In  genomics,  it  is  more  common  to  represent  subjects/SNPs/genes  in  the  columns. 
For  example,  genes  are  rows  and  samples  are  columns.  The  sample  covariance 

'_p 

matrix  usually  denoted  with  X  X  and  has  cells  representing  covariance  between 

two  units.  Yet,  for  this  to  be  the  case,  we  need  the  rows  of  X  to  represent  the  subjects 

T 

and  the  columns  to  represent  the  variables,  or  features.  Here,  we  have  to  compute,  YY 
instead  following  the  rescaling. 


6.4  Summary  (PC A  vs.  ICA  vs.  FA) 

Principle  Component  Analysis  (PCA),  Independent  Component  Analysis  (ICA),  and 
Factor  Analysis  (FA)  are  similar  strategies,  seeking  to  identify  a  new  basis  (vectors 
representing  the  principal  directions)  that  the  data  is  projected  against  to  maximize 
certain  (specific  to  each  technique)  objective  functions.  These  basis  functions,  or 
vectors,  are  just  linear  combinations  of  the  original  features  in  the  data/signal. 

The  singular  value  decomposition  (SVD),  discussed  later  in  this  chapter,  provides 
a  specific  matrix  factorization  algorithm  that  can  be  employed  in  various  techniques 

'T 

to  decompose  a  data  matrix  Xm  x  n  as  UXV  ,  where  U  is  an  m  x  m  real  or  complex 
unitary  matrix  (UU  =  U  U  =  /),  Z  is  a  m  x  n  rectangular  diagonal  matrix  of 
singular  values ,  representing  non-negative  values  on  the  diagonal,  and  V  is  an  n  x  n 
unitary  matrix  (Table  6.2). 
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Table  6.2  Summary  of  some  dimensionality  reduction  methods 


Method 

Assumptions 

Cost  function  optimization 

Applications 

PCA 

Gaussian 

signals 

Aims  to  explain  the  variance  in 
the  original  signal.  Minimizes 
the  covariance  of  the  data  and 
yields  high-energy  orthogonal 
vectors  in  terms  of  the  signal 
variance.  PCA  looks  for  an 
orthogonal  linear  transformation 
that  maximizes  the  variance  of 
the  variables 

Relies  on  first  and  second 
moments  of  the  measured  data, 
which  makes  it  useful  when  data 
features  are  close  to  Gaussian 

ICA 

No  Gaussian 
signal 

assumptions 

Minimizes  higher-order  statistics 
(e.g.,  third  and  fourth  order 
skewness  and  kurtosis),  effec¬ 
tively  minimizing  the  mutual 
information  of  the  transformed 
output.  ICA  seeks  a  linear  trans¬ 
formation  where  the  basis  vec¬ 
tors  are  statistically  independent, 
but  neither  Gaussian,  orthogonal 
or  ranked  in  order 

Applicable  for  non-Gaussian, 
very  noisy,  or  mixture  processes 
composed  of  simultaneous  input 
from  multiple  sources 

FA 

Approximately 
Gaussian  data 

Objective  function  relies  on  sec¬ 
ond  order  moments  to  compute 
likelihoods.  FA  factors  are  linear 
combinations  that  maximize  the 
shared  portion  of  the  variance 
underlying  latent  variables , 
which  may  use  a  variety  of  opti¬ 
mization  strategies  (e.g.,  maxi¬ 
mum  likelihood) 

PCA-generalization  used  to  test 
a  theoretical  model  of  latent 
factors  causing  the  observed 
features 

6.5  Principal  Component  Analysis  (PCA) 

PCA  (principal  component  analysis)  is  a  mathematical  procedure  that  transforms  a 
number  of  possibly  correlated  variables  into  a  smaller  number  of  uncorrelated  vari¬ 
ables  through  a  process  known  as  orthogonal  transformation. 


6.5.1  Principal  Components 

Let’s  consider  the  simplest  situation  where  we  have  n  observations  {p\,p2,  •  • pn} 
with  two  features  pt  =  (xb  yy).  When  we  draw  them  on  a  plot,  we  use  the  v-axis  and  y- 
axis  for  positioning.  However,  we  can  make  our  own  coordinate  system  by  principal 
components  (Fig.  6.10). 
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Fig.  6.10  Schematic 
representation  of  the  first 
two  principal  components 
(simulated  data) 


ex<-data.frame(x=c(lj  3}  5,  6}  10,  16,  50) ,  y=c(4,  6 ,  5,  7,  10 ,  13 ,  12)) 
regl<-Lm(y~x,  doto=ex) 
plot (ex) 

abLine(reglj  coL= ' red ' j  Lwd=4) 
text (40,  10.  5 j  "pci") 
segments (10. 5,  11 ,  15 ,  7,  Lwd=4) 
text  ( 11  j  7j  "pc2  ") 

Illustrated  on  the  graph,  the  first  PC,  pc\  is  a  minimum  distance  fit  in  the  feature 
space.  The  second  PC  is  a  minimum  distance  fit  to  a  line  perpendicular  to  the  first 
PC.  Similarly,  the  third  PC  would  be  a  minimum  distance  fit  to  all  previous  PCs.  In 
our  case  of  a  2D  space,  two  PC’s  is  the  most  we  can  have.  In  higher  dimensional 
spaces,  we  have  to  figure  out  how  many  PCs  are  needed  to  make  the  best  performance. 

In  general,  the  formula  for  the  first  PC  is  pc\  =  al  X  =  y  cij ? \Xt  where  Xt  is 

a«x  1  vector  representing  a  column  of  the  matrix  X  (complete  design  matrix  with 
a  total  of  n  observations  and  N  features).  The  weights  ax  =  {a!  x,a2,  i, . .  1} 

are  chosen  to  maximize  the  variance  of  pc  i.  According  to  this  rule,  the  kth  PC 

t  v — >  N 

pck  =  ak  X  =  y  _ i  ai^Xi,  where  ak  =  {ax^  k,  a2t  k, . . aN  k}  has  to  be  constrained 
by  more  conditions: 

1.  Variance  of  pck  is  maximized 

2.  Cov(pck,pci)  =  0,  V  1  <  l  <k 

T> 

3.  ak  ak  =  1  (the  weights  vectors  are  unitary). 

Let’s  figure  out  how  to  find  ax.  To  begin,  we  need  to  express  the  variance  of  our 
first  principal  component  using  the  variance  covariance  matrix  of  X: 

Var(pc\)  =  E(pc\)  -  ( E(pc\ ))2  = 

N  N 

V  ai,iajAE(xiXj)  -  ]T  aiAajAE{xi)E(xj)  = 

4  7=1  h  7=1 

N 

y  1  ai,  laj,  l^ij’ 

h  7=1 


X 


where  Sk  j  =  E^Xj)  -  E(xl)E(xj). 
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This  implies  Var(pc\)  =  a^Sa\ ,  where  S  =  Sk  j  is  the  covariance  matrix  of 

r~r 

X  =  {Xu  . .  .,XN}.  Since  ai  maximized  Varipcy)  and  the  constrain  axa\  —  \  holds, 
we  can  rewrite  ax  as: 


a\  —  arg  maxai  ( a[Sa\  —  A(al<j\  —  l)). 

Where  the  part  after  the  subtraction  should  be  trivial.  Take  the  derivative  of  this 
expression  w.r.t.  and  set  the  derivative  to  zero,  which  yields  ( S  —  A IN)ai  =  0. 

In  Chap.  5  we  showed  that  a\  will  correspond  to  the  largest  eigenvalue  of  S ,  the 
variance  covariance  matrix  of  X.  Hence,  pci  retains  the  largest  amount  of  variation  in 
the  sample.  Likewise,  ak  is  the  Mi  largest  eigenvalue  of  S. 

PCA  requires  data  matrix  to  have  zero  empirical  means  for  each  column.  That  is, 
the  sample  mean  of  each  column  has  been  shifted  to  zero. 

Let’s  use  a  subset  (N  =  33)  of  Parkinson's  Progression  Markers  Initiative  (PPMI) 
database  to  demonstrate  the  relationship  between  S  and  PC  loadings.  First,  we  need 
to  import  the  dataset  into  R  and  delete  the  patient  ID  column. 


Library (rvest) 

wiki_urL  <-read_htmL ( "http: //wiki . socr. umich.edu/index. php/SMHS_PCA_ICA_FA ") 
htmL_nodes ( wiki_urLj  "#content ") 

pd.sub  <-  htmL_tabLe(htmL_nodes(wiki_urLj  "tabLe" ) [ [1] ] ) 
s ummary (pd.sub) 


## 

Patient_ID 

Top_  of_SN_  Voxe  L_Intensi  ty_Rat  i  o 

## 

Min.  : 3001 

Min.  : 1.058 

## 

1st  Qu. : 3012 

1st  Qu. : 1 . 334 

## 

Median  :3029 

Median  : 1.485 

## 

Mean  : 3204 

Mean  : 1 . 532 

## 

3rd  Qu. :3314 

3rd  Qu. : 1 . 755 

## 

Max.  :3808 

Max.  :2.149 

## 

S ide_of_SN_  Voxe  L_Intensi ty_ Ratio  Part_ I A 

Part_IB 

## 

Min.  : 0.9306 

Min.  : 0.000 

Min.  :  0.000 

## 

1st  Qu. : 0.9958 

1st  Qu. : 0.000 

1st  Qu. :  2.000 

## 

Median  : 1.1110 

Median  : 1.000 

Median  :  5.000 

## 

Mean  : 1.1065 

Mean  : 1 . 242 

Mean  :  4 . 909 

## 

3rd  Qu. : 1.1978 

3rd  Qu.:2.000 

3rd  Qu.:  7.000 

## 

Max.  : 1.3811 

Max.  : 6.000 

Max.  : 13. 000 

## 

Part_II 

Part_III 

## 

Min.  :  0.000 

Min.  :  0.00 

## 

1st  Qu. :  0. 000 

1st  Qu. :  2.00 

## 

Median  :  2.000 

Median  : 12.00 

## 

Mean  :  4 . 091 

Mean  :13.39 

## 

3rd  Qu.:  6.000 

3rd  Qu.:20.00 

## 

Max.  : 17. 000 

Max.  : 36.00 

pd . sub<-pd . sub[ j  -1] 

Then,  we  need  to  center  the  pdsub  by  subtracting  the  average  of  all  column 
means  from  each  element.  Next  we  change  pd .  sub  to  a  matrix  and  get  its  variance 
covariance  matrix,  S.  Now,  we  are  able  to  calculate  the  eigenvalues  and  eigen- 

vectors  of  S. 


246 


6 


Dimensionality  Reduction 


mu< -apply (pd. sub }  2,  mean) 
mean (mu) 


##  [1]  4.379068 


pd. center<-as.matrix(pd. sub) -mean (mu) 
S<-cov(pd. center) 
eigen (S) 


##  $  values 

##  [1]  1 . 315073e+02  1.178340e+01  6.096920e+00  1.424351e+00  6.094592e-02 
##  [6]  8. 03 5403 e- 03 
## 


##  $vectors 


## 

##  [1,7 
##  [2,7 
##  [ 3  ,7 
##  [ 4  ,7 
##  [5,  ] 
##  [6,7 
## 

##  fi,7 
##  [2,7 
##  [3,7 
##  [4,7 
##  [5,7 
##  [6,7 


[A  7 

[j2] 

[A] 

M7 

[j5] 

-0.007460885 

-0.0182022093 

0.016893318 

0.02071859 

0.97198980 

-0.005800877 

0.0006155246 

0.004186177 

0.01552971 

0.23234862 

0.080839361 

-0.0600389904 

-0.027351225 

0.99421646 

-0.02352324 

0.229718933 

-0.2817718053 

-0.929463536 

-0.06088782 

0.01466136 

0.282109618 

-0.8926329596 

0.344508308 

-0.06772403 

-0.01764367 

0.927911126 

[A] 

-0.232667561 

0.972482080 

-0.009618592 

0.003019008 

0.006061772 

0.002456374 

0.3462292153 

0.127908417 

-0.05068855 

0.01305167 

The  next  step  is  to  calculate  the  PCs  using  the  prcomp  ( )  function  in  R.  Note 
that  we  will  use  the  uncentered  version  of  the  data  and  use  center=T  option.  We 
stored  the  model  information  into  peal.  Then  pcal$ rotation  provides  the 
loadings  for  each  PC. 


pcal<-prcomp(as,matrix(pd.sub)j  center  =  T) 
summary (peal ) 


##  Importance  of  components : 

##  PCI  PC2  PC3 
##  Standard  deviation  11.4677  3.4327  2.46919 
##  Proportion  of  Variance  0.8716  0.0781  0.04041 
##  Cumulative  Proportion  0.8716  0.9497  0.99010 


PC4  PC5 
1.19346  0.2469 
0 . 00944  0 . 0004 
0.99954  1.0000 


PC6 
0.08964 
0. 00005 
1 . 00000 


pcal$rotation 


##  PCI 
##  Top_of_SN_Voxel_Intensity_Ratio  0.  007460885 
##  Side_of_SN_Voxel_Intensity_Ratio  0.005800877 
##  Part_IA  -0.080839361 
##  Part  IB  -0.229718933 


PC2  PC3 
-0. 0182022093  0. 016893318 
0. 0006155246  0. 004186177 
-0.0600389904  -0.027351225 
-0.2817718053  -0.929463536 
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##  Part_II 
##  Part_III 
## 

##  Top_ of_SN_ Voxe L_Intensi ty_Rot i o 
##  Side_ of_SN_ Voxe L_Intensi ty_Ratio 
##  Part_IA 
##  Part_IB 
##  Part_II 
##  Part  III 


-0. 282109618  -0. 8926329596  0. 344508308 

-0. 927911126  0.3462292153  0. 127908417 

PC4  PC5  PC6 

0.02071859  -0.97198980  -0.232667561 
0.01552971  -0.23234862  0.972482080 

0.99421646  0.02352324  -0.009618592 

-0. 06088782  -0.01466136  0. 003019008 

-0.06772403  0.01764367  0.006061772 

-0. 05068855  -0.01305167  0. 002456374 


The  loadings  are  just  the  eigenvectors  times  - 1 .  This  actually  represents  the  same 
line  in  6D  dimensional  space  (we  have  six  columns  for  the  original  data).  The 
multiplier  - 1  represents  the  opposite  direction  in  the  same  line.  For  further  compar¬ 
isons,  we  can  load  the  f  actoextra  package  to  get  the  eigenvalues  of  PCs. 


#  install. packages ("f actoextra" ) 

Library ( "f actoextra" ) 

eigen<-get_eigenvalue(pcal ) ;  eigen 

cumulative . variance . percent 

87.15964 
94.96938 
99.01026 
99.95428 
99.99467 
100. 00000 


## 

##  Dim.l 
##  Dim. 2 
##  Dim. 3 
##  Dim. 4 
##  Dim. 5 
##  Dim. 6 


eigenvalue 
1 . 315073e+02 
1 . 178340e+01 
6.  096920e+00 
1 . 424351e+00 
6. 094592e-02 
8. 03 5403 e- 03 


variance . percent 
87.159638589 
7.809737384 
4.040881920 
0.944023059 
0.040393390 
0.005325659 


The  eigenvalues  correspond  to  the  amount  of  the  variation  explained  by  each 
principal  component  (PC),  which  represesnts  the  eigenvalues  for  the  S  matrix. 

To  see  detailed  information  about  the  variances  that  each  PC  explains,  we  utilize 
the  plot  ()  function.  We  can  also  visualize  the  PC  loadings  (Figs.  6.11,  6.12, 
and  6.13). 


Fig.  6.11  Scree  plot  of  the  peal 
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Fig.  6.12  A  biplot, 
enhanced  scatterplot, 
showing  both  points  and 
vectors  representing 
structure  of  the  data  in  terms 
of  the  projections  of  the 
features  onto  the  main  two 
principal  component 
directions 


'tf 

o 


(M 

O 


CL 


CJ 


PCA  -  Biplot 


Diml  (87.2%) 


Groups 
0 

1 

2 

3 

4 
6 


Fig.  6.13  A  more  elaborate  biplot  of  the  same  Parkinson’s  disease  dataset 
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pLot(pcal ) 

Library (graphics) 

bipLot(pcalj  choices  =  1:2,  scaLe  =  1 ,  pc.bipLot  =  F) 

Library ( "factoextra" ) 

#  Data  for  the  supplementary  qualitative  variables 
quaLit_vars  <-  as .factor (pd . sub$Part_IA) 

head( quaLit_vars ) 

##  [1]  0  3  1  0  1  1 
##  LeveLs:  012346 

#  for  plots  of  individuals 

#  fviz_pca_ind(pcal,  habillage  =  qualit_vars,  addEllipses  =  TRUE, 
ellipse. level  =  0.68)  + 

#  theme_minimal( ) 

#  for  Biplot  of  individuals  and  variables 

fviz_pca_bipLot(pcal,  axes  =  c(l ,  2),  geom  =  c( "point",  "text"), 
coL.ind  =  "bLack",  coL.var  =  "steeLbLue",  LabeL  =  "aLL", 
invisibLe  =  "none",  repeL  =  T,  habiLLage  =  quaLit_vars, 
paLette  =  NULL,  addELLipses  =  TRUE,  titLe  =  "PCA  -  BipLot") 

The  histogram  plot  has  a  clear  “elbow”  point  at  the  second  PC.  Two  PCs  explains 
about  95%  of  the  total  variation.  Thus,  we  can  use  the  first  2  PCs  to  represent  the 
data.  In  this  case,  the  dimension  of  the  data  is  substantially  reduced. 

Here,  biplot  uses  PCI  and  PC2  as  the  axes  and  red  vectors  to  represent  the 
direction  of  variables  after  adding  loadings  as  weights.  It  help  us  to  visualize  how  the 
loadings  are  used  to  rearrange  the  structure  of  the  data. 

Next,  let’s  try  to  obtain  a  bootstrap  test  for  the  confidence  interval  of  the 
explained  variance  (Fig.  6.14). 

set,seed(12) 
num_boot  =  1000 
bootstrap_it  =  function(i)  { 

data_resampLe  =  pd. sub[sampLe(l : nrow(pd. sub) ,  nrow(pd. sub) ,  repLace=TRUE) ,  ] 
p_resampLe  =  princomp(data_resampLe, cor  =  T) 
return(sum(p_resampLe$sde\j[l:3]/K2)/sum(p_resampLe$sde\j/y2)  ) 

} 

pco  =  data. frame (per=sappLy(l:num_boot,  bootstrap_it) ) 
quantiLe(pco$per,  probs  =  c(0. 025,0. 975) ) 

#  specify  95-th  %  Confidence  Interval 

##  2.5 %  97.5 % 

##  0.8134438  0.9035291 

corpp  =  sum(pcal$sdev[l : 3]A2)/sum(pcal$sdevA2) 
require ( ggp Lot 2) 

pLot  =  ggpLot(pco,  aes(x=pco$per) )  + 

geom_histogram( )  +  geom_vLine(xintercept=corpp,  coLor= ' yeLLow ' )+ 

Labs (titLe  =  "Percent  Var  ExpLained  by  the  first  3  PCs")  + 
theme (pLot .titLe  =  eLement_text(hjust  =  0.5))+ 

Labs (x=' per c  of  var') 
show(pLot) 

##  ' stat_bin()'  using  ' bins  =  30'.  Pich  better  vaLue  wit/?  ' biniA/idth' . 
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Percent  Var  Explained  by  the  first  3  PCs 


perc  of  var 

Fig.  6.14  A  histogram  plot  illustrating  the  proportion  of  the  energy  of  the  original  dataset 
accounted  for  by  the  first  three  principal  components 


6.6  Independent  Component  Analysis  (ICA) 

ICA  aims  to  find  basis  vectors  representing  independent  components  of  the  original 
data.  For  example,  this  may  be  achieved  by  maximizing  the  norm  of  the  fourth  order 
normalized  kurtosis,  which  iteratively  projects  the  signal  on  a  new  basis  vector, 
computes  the  objective  function  (e.g.,  the  norm  of  the  kurtosis)  of  the  result,  slightly 
adjusts  the  basis  vector  (e.g.,  by  gradient  ascent),  and  recomputes  the  kurtosis  again. 
The  end  of  this  iterative  process  generates  a  basis  vector  corresponding  to  the  highest 
(residual)  kurtosis  representing  the  next  independent  component. 

The  process  of  Independent  Component  Analysis  is  to  maximize  the  statistical 
independence  of  the  estimated  components.  Assume  that  each  variable  Xt  is  gener¬ 
ated  by  a  sum  of  n  independent  components. 


Xi  =  aiAs i  H - YaUnsn. 
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Here,  Xt  is  generated  by  si  :  sn  and  ai  {  :  ain  are  the  corresponding  weights. 
Finally,  we  rewrite  X  as 


X  =  As, 

where  X  =  (Xu . .  .,Xnf,  A  =  (au . .  .,an)T,  at  =  (aiA, . .  .,aUn)  and  s  =  (s1; . . snf. 
Note  that  s  is  obtained  by  maximizing  the  independence  of  the  components.  This 
procedure  is  done  by  maximizing  some  independence  objective  function. 

ICA  assumes  all  of  its  components  (si)  are  non-Gaussian  and  independent  of  each 
other. 

We  will  now  introduce  the  fast  ICA  function  in  R. 

fastICA(X,  n.comp,  alg.typ,  fun,  rownorm,  maxit,  tol) 


•  X:  data  matrix 

•  n.comp:  number  of  components, 

•  alg.type:  components  extracted  simultaneously  (alg.typ  ==  "parallel") 
or  one  at  a  time  (alg .  typ  ==  "deflation") 

•  fun:  functional  form  of  F  to  approximate  to  neg-entropy, 

•  rownorm:  whether  rows  of  the  data  matrix  X  should  be  standardized  beforehand 

•  maxit:  maximum  number  of  iterations 

•  tol:  a  positive  scalar  giving  the  tolerance  at  which  the  un-mixing  matrix  is 
considered  to  have  converged. 

Let’s  generate  a  correlated  X  matrix. 


S  <-  matrix(runif (10000) j  5000 ,  2) 
S[l:10j  ] 


##  [A]  [j2] 

##  [lj]  0.19032887  0.92326457 
##  [2, ]  0.64582044  0.36716717 
##  [3/]  0.09673674  0.51115358 
##  [4/]  0.24813471  0.03997883 
##  [5j ]  0.51746238  0.03503276 
##  [6, ]  0.94568595  0.86846372 
##  [7,  ]  0.29500222  0.76227787 
##  [8, ]  0.93488888  0.97061365 
##  [9/]  0.89622932  0.62092241 
##  [10,  ]  0.33758057  0.84543862 


A  <-  matrix(c(lj  1,  -1,  3),  2 , 
X  <-  S  %*%  A  #  In  R,  and 


2,  by row  =  TRUE) 

"%*%"  indicate  "scalar"  and  matrix  multiplicat 


ion,  respectively! 
cor(X) 


##  [,1]  [,2] 

##  [1,  ]  1.0000000  -0.4563297 

##  [2,  ]  -0.4563297  1.0000000 
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The  correlation  between  two  variables  is  —0.4563297.  Then  we  can  start  to  fit  the 
ICA  model. 

#  install . packages( "fastICA" ) 

Library (fast ICA) 

a  <-  fastICA(Xj  2j  aLg.typ  =  "paraLLeL" ,  fun  =  " Logcosh" ,  aLpha  =  1} 

method  =  "C" ,  row. norm  =  FALSE ,  maxit  =  200 , 
toL  =  0.0001) 

To  visualize  how  correlated  the  pre-processed  data  is  and  how  independent  the 
resulting  S  is,  we  can  draw  the  following  two  plots  (Fig.  6.15). 

par (mf row  =  c(l}  2)) 

piot(a$Xj  main  =  "Pre-processed  data") 
pLot(a$Sj  main  =  "ICA  components") 

Finally,  we  can  check  the  correlation  of  two  components  in  the  ICA  result,  A';  it  is 
nearly  0. 

cor(a$S) 

##  [A]  [,2] 

##  [lj]  1 . 000000e+00  -7 . 677818e-16 
##  [2/]  -7. 677818e-16  1.000000e+00 


ICA  components 


■1.5  -1.0  -0.5  0.0  0.5  1.0  1.5 
a$S[,1] 


Pre-processed  data 


-0.5 


0.0 

a$X[,1] 


1.5 


1.0 


0.5 


co 

m 

c0 


-0.5 


-1.0 


-1.5 


Fig.  6.15  Scatterplots  of  the  raw  data  (left)  illustrating  intrinsic  relation  in  the  simulated  bivariate 
data  and  the  ICA-transformed  data  (right)  showing  random  scattering 
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To  do  a  more  interesting  example,  we  can  use  the  pd .  sub  dataset  (Parkinson’s 
disease).  It  has  six  variables  and  the  correlation  is  relatively  high.  After  fitting  the 
ICA  model,  the  components  are  nearly  independent. 


cor(pd. sub ) 

## 

##  Top_ of_SN_ Voxe L_Intensi ty_Ratio 

##  Side_ of_SN_ Voxe L_Intensi ty_Ratio 

##  Part_IA 

##  Part_IB 

##  Part_II 

##  Part_III 

## 

##  Top_ of_SN_ Voxe L_Intensi ty_Rotio 

##  Side_ of_SN_ Voxe L_Intensi ty_Ratio 

##  Part_IA 

##  Part_IB 

##  Part_II 

##  Part_III 

## 

##  Top_ of_SN_ Voxe L_Intensi ty_Rat i o 

##  Side_ of_SN_ Voxe L_Intensi ty_Ratio 

##  Port_IA 

##  Part_IB 

##  Part_II 

##  Part_III 

## 

##  Top_ of_SN_ Voxe L_Intens i ty_Rat i o 
##  Side_ of_SN_ Voxe L_Intensi ty_Ratio 
##  Part_IA 
##  Part_IB 
##  Part_II 
##  Part  III 


Top_of_SN_VoxeL_Intensity_Ratio 

1 . 00000000 
0.54747225 
-0.10144191 
-0.26966299 
-0.04358545 
-0.33921790 

Side_of_SN_VoxeL_Intensity_Ratio 

0.5474722 
1 . 0000000 
-0.2157587 
-0.4438992 
-0.3766388 
-0.5226128 


Part_IA 

Part_IB 

Part_II 

-0.1014419 

-0.2696630 

-0.04358545 

-0.2157587 

-0.4438992 

-0.37663875 

1 . 0000000 

0.4913169 

0.50378157 

0.4913169 

1 . 0000000 

0.57987562 

0.5037816 

0.5798756 

1 . 00000000 

0.5845831 

Part_III 

-0.3392179 

-0.5226128 

0.5845831 

0.6735584 

0.6390134 

1 . 0000000 

0. 6735584 

0.63901337 

al<-fastICA(pd. subj  2}  aLg.typ  =  "paraLLeL" j  fun  =  " Logcosh" j  alpha  =  1, 

method  =  "C" }  row. norm  =  FALSE j  maxit  =  200 } 
tol  =  0.0001) 
par (mf row  =  c(lj  2)) 
cor(al$X) 

## 

##  [ 1 
##  [ 2  ,J 

##  [3j ] 

##  [4, ] 

##  [5, ] 

##  [ 6  ,J 

cor(al$S) 

##  [A]  [,2] 

##  [lj]  1 . 000000e+00  1. 088497 e- 15 
##  [2,  ]  1 . 088497e-15  1.000000e+00 


[A] 

l .  00000000 

0.54747225 

-0.10144191 

-0.26966299 

-0.04358545 

-0.33921790 


[,2] 
0.5474722 
1 . 0000000 
-0.2157587 
-0.4438992 
-0.3766388 
-0.5226128 


[j3] 
-0.1014419 
-0.2157587 
1 . 0000000 
0.4913169 
0.5037816 
0.5845831 


[j4] 
-0.2696630 
-0.4438992 
0.4913169 
1 . 0000000 
0.5798756 
0. 6735584 


[A] 

-0.04358545 
-0.37663875 
0.50378157 
0.57987562 
1 . 00000000 
0.63901337 


[j6] 

-0.3392179 
-0.5226128 
0.5845831 
0. 6735584 
0.6390134 
1 . 0000000 


Notice  that  we  only  have  two  ICA  components  instead  of  six  variables,  success¬ 
fully  reducing  the  dimension  of  the  data. 
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6.7  Factor  Analysis  (FA) 

Similar  to  ICA  and  PCA,  FA  tries  to  find  components  in  the  data.  As  a  generalization 
of  PCA,  FA  requires  that  the  number  of  components  is  smaller  than  the  original 
number  of  variables  (or  columns  of  the  data  matrix).  FA  optimization  relies  on 
iterative  perturbations  with  full-dimensional  Gaussian  noise  and  maximum- 
likelihood  estimation  where  every  observation  in  the  data  represents  a  sample 
point  in  a  subspace.  Whereas  PCA  assumes  the  noise  is  spherical,  Factor  Analysis 
allows  the  noise  to  have  an  arbitrary  diagonal  covariance  matrix  and  estimates  the 
subspace  as  well  as  the  noise  covariance  matrix. 

Under  FA,  the  centered  data  can  be  expressed  in  the  following  form: 

xi  —  Fi  =  k,  lF\  +  '  '  '  +  li,kF, k  +  C  =  LF  +  £j, 

where  i  E  1,  . .  y  £  1, . . .,  k,  k  <p  and  are  independently  distributed  error  terms 
with  zero  mean  and  finite  variance. 

Let’s  do  FA  in  R  with  function  f  actanal  ( ) .  According  to  PCA,  our  pd .  sub 
dataset  can  explain  95%  of  variance  with  the  first  two  principal  components.  This 
suggest  that  we  might  need  two  factors  in  FA.  We  can  double  check  that  by  the 
following  commands  (Fig.  6.16). 


Non  Graphical  Solutions  to  Scree  Test 


Fig.  6.16  Scree  plots  of  various  solutions 


6.7  Factor  Analysis  (FA) 
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##  Report 
## 

##  Details 

## 

For  a  nScree 

:  components 

Class 

##  Eigenvalues 

Prop 

Cumu  Par, 

.Analysis 

Pred. eig 

OC  Acc .factor 

AF 

##  1 

3 

1 

1 

1 

1 

NA 

(<  AF) 

##  2 

1 

0 

1 

1 

1 

(<  OC)  1 

##  3 

1 

0 

1 

1 

0 

1 

##  4 

0 

0 

1 

1 

0 

0 

##  5 

0 

0 

1 

1 

NA 

0 

##  6 

## 

0 

0 

1 

0 

NA 

NA 

## 

##  Number  of  factors  retained  by  index 
## 

##  noc  naf  nparaiiei  nkaiser 
##121  2  2 

Three  out  of  four  rules  in  Cattell’s  Scree  test  summary  suggest  we  should  use  two 
factors.  Thus,  in  function  f  actanal  ( )  we  use  f  actors=2  and  the  varimax 
rotation  as  performing  arithmetic  to  obtain  a  new  set  of  factor  loadings.  Oblique 
promax  and  Procrustes  rotation  (projecting  the  loadings  to  a  target  matrix 
with  a  simple  structure)  are  two  other  commonly  used  matrix  rotations. 

fit<-factanaL(pd. subj  factors=2j  rotation=" varimax" ) 

#  f it<-factanal(pd. sub,  factors=2,  rotation="promax" )  #  the  most  popular  obi 

ique  rotation;  And  fitting  a  simple  structure 

fit 

##  Call: 

##  factanal(x  =  pd.subj  factors  =  2}  rotation  =  "varimax" ) 

## 

##  Uniquenesses : 


## 

Top_of_SN_Voxel_Intensity_Ratio 

Side_of_ 

SN_Voxel_ 

Inter sity_Ratio 

## 

0.018 

0.534 

## 

Part_IA 

Part_IB 

## 

0.571 

0.410 

## 

Part_II 

Part_III 

## 

0.392 

0.218 

## 

## 

Loadings: 

## 

Factorl 

Factor2 

## 

Top_  of_SN_  Voxe  L_Intensi  ty_Ratio 

0.991 

## 

Side_of_SN_Voxel_Intensity_Ratio 

-0.417 

0.540 

## 

Part_IA 

0.650 

## 

Part_IB 

0.726 

-0.251 

## 

Part_II 

0.779 

## 

Part  III 

0.825 

-0.318 

## 

##  Factorl  Factor2 

##  SS  Loadings  2.412  1.445 

##  Proportion  Var  0.402  0.241 

##  Cumulative  Var  0.402  0.643 

## 

##  Test  of  the  hypothesis  that  2  factors  are  sufficient . 

##  The  chi  square  statistic  is  1.35  on  4  degrees  of  freedom. 
##  The  p-vaLue  is  0.854 
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Fig.  6.17  Factor  analysis  results  projecting  the  key  features  on  the  first  two  factor  dimensions 


Here  the  p-value  0.854  is  very  large,  suggesting  that  we  failed  to  reject  the  null- 
hypothesis  that  two  factors  are  sufficient.  We  can  also  visualize  the  loadings  for  all 
the  variables  (Fig.  6.17). 


Load  <-  fit$ Loadings 

pLot( Loadj  type="n")  #  set  up  plot 

text( Load,  LabeLs=coLnames(pd . sub) ,  cex=.7)  #  add  variable  names 
This  plot  display s  factors  1  and  2  on  the  v-axis  and  y-axis,  respectively. 


6.8  Singular  Value  Decomposition  (SVD) 

SVD  is  a  factorization  of  a  real  or  complex  matrix.  If  we  have  a  data  matrix  X  with 
n  observation  and  p  variables,  it  can  be  factorized  into  the  following  form: 

X  =  udvt, 

rp  rp 

where  U  is  a  n  x  p  unitary  matrix,  that  U  U  =  I,  Disap  x  p  diagonal  matrix,  and  V 
is  a  p  x  p  unitary  matrix,  which  is  the  conjugate  transpose  of  the  n  x  n  unitary 

'T 

matrix,  V.  Thus,  we  have  V  V  =  I. 

SVD  is  closely  linked  to  PC  A  (when  correlation  matrix  is  used  for  calculation). 
U  are  the  left  singular  vectors.  D  are  the  singular  values.  U  gives  PC  A  scores.  V  are 
the  right  singular  vectors-PCA  loadings. 


6.8  Singular  Value  Decomposition  (SVD) 
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We  can  compare  the  output  from  the  svd  ( )  function  and  the  pr incomp  ( ) 
function  (another  R  function  for  PCA).  Still,  we  are  using  the  pd .  sub  dataset. 
Before  the  SVD,  we  need  to  scale  our  data  matrix. 


#SVD  output 
df<-nrow(pd. sub)-l 
zvars<-scale(pd . sub) 
z . svd<-svd(zvars ) 
z. svd$d/sqrt(df) 


##  [1]  1.7878123  1.1053808  0.7550519  0.6475685  0.5688743  0.5184536 


z.svd$v 


## 

##  [ 1 
##  [2, ] 
##  [3j ] 
##  [4,] 
##  [5,  ] 
##  [ 6  ,J 


[A] 

0.2555204 

0.3855208 

-0.3825033 

-0.4597352 

-0.4251107 

-0.4976933 


[A] 

0.71258155 

0.47213743 

0.37288211 

0.09803466 

0.34167997 

0.06258370 


l  A] 

-0.37323594 

0.35665523 

0.70992668 

-0.11166513 

-0.46424927 

0.03872473 


[A] 

0.10487773 

-0.43312945 

0.31993403 

-0.79389290 

0.26165346 

-0.01769966 


[A] 

-0.4773992 

0.5581867 

-0.2379855 

-0.2915570 

0.5341197 

0.1832789 


[A] 

0.22073161 

0.04564469 

-0.22728693 

-0.22647775 

-0.36505061 

0.84438182 


#PCA  output 

pca2<-princomp(pd. subj  cor=T) 
pca2 


##  Call: 

##  princomp(x  =  pd.subj  cor  =  T) 

## 

##  Standard  deviations : 

##  Comp . 1  Comp . 2  Comp . 3  Comp . 4  Comp . 5  Comp . 6 
##  1.7878123  1.1053808  0.7550519  0.6475685  0.5688743  0.5184536 

## 

##  6  variables  and  33  observations . 


Loadings(pca2) 

## 

##  Loadings: 


## 

Comp . 1 

Comp . 2 

Comp . 3 

Comp . 4 

Comp . 5 

Comp . 6 

## 

Top_of_SN_ 

Voxel_Intensity_Ratio 

-0.256 

-0.713 

-0.373 

-0.105 

0.477 

-0.221 

##Side_of_SN_ 

VoxeL_Intensity_Ratio 

-0.386 

-0.472 

0.357 

0.433 

-0.558 

## 

Part_IA 

0.383 

-0.373 

0.710 

-0.320 

0.238 

0.227 

## 

Part_IB 

0.460 

-0.112 

0.794 

0.292 

0.226 

## 

Part_II 

0.425 

-0.342 

-0.464 

-0.262 

-0.534 

0.365 

## 

Part  III 

0.498 

-0.183 

-0.844 

## 


## 

Comp . 1 

Comp . 2 

Comp . 3 

Comp . 4 

Comp . 5  Comp . 6 

##  SS  Loadings 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

##  Proportion  Var 

0.167 

0.167 

0.167 

0.167 

0.167 

0.167 

##  Cumulative  Var 

0.167 

0.333 

0.500 

0.667 

0.833 

1.000 

When  the  correlation  matrix  is  used  for  calculation  (cor=T),  the  V  matrix  of 
SVD  contains  the  loadings  of  the  PCA. 
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6.9  SVD  Summary 

'T 

Intuitively,  the  SVD  approach  X  =  UDV  represents  a  composition  of  the  (centered!) 

data  into  three  geometrical  transformations:  a  rotation  or  reflection  (£/)>  a  scaling 

(D),  and  a  rotation  or  reflection  (V).  Here  we  assume  that  the  data  X  stores  samples/ 

cases  in  rows  and  variables/features  in  columns.  If  these  are  reversed,  then  the 

interpretations  of  the  U  and  V  matrices  reverse  as  well. 

•  The  columns  of  V  represent  the  directions  of  the  principal  axes,  the  columns  of 
UD  are  the  principal  components,  and  the  singular  values  in  D  are  related  to 

d2 

the  eigenvalues  of  data  variance-covariance  matrix  (Z)  via  =  — - — ,  where 

n  —  1 

the  eigenvalues  capture  the  magnitude  of  the  data  variance  in  the 

respective  PCs. 

•  The  standardized  scores  are  given  by  columns  of  \/n  —  1 U  and  the  corresponding 

loadings  are  given  by  columns  of  VD.  However,  these  “loadings”  are  not  the 
principal  directions.  The  requirement  for  X  to  be  centered  is  needed  to  ensure  that 
the  covariance  matrix  Cov(X)  =  XTX . 

•  Alternatively,  to  perform  PCA  on  the  correlation  matrix  (instead  of  the 
covariance  matrix),  the  columns  of  X  need  to  be  scaled  (centered  and 
standardized). 

•  To  reduce  the  data  dimensionality  from  p  to  k  <  p,  we  multiply  the  first  k  columns 
of  U  by  the  k  x  k  upper-left  corner  of  the  matrix  D  to  get  an  n  x  k  matrix  UkDk 
containing  the  first  k  PCs. 

T 

•  Multiplying  the  first  k  PCs  by  their  corresponding  principal  directions  Vk 
reconstructs  the  original  data  from  the  first  k  PCs,  Xk=UkDkVk ,  with  the  lowest 
possible  reconstruction  error. 

•  Typically,  we  have  more  subjects/cases  (n)  than  variables/features  (p  <  ri).  As 
Un  x  n  and  Vp  x  p,  the  last  n  —  p  >  0  columns  of  U  may  be  trivial  (zeros).  It’s 
customary  to  drop  the  zero  columns  of  U  for  n  p  to  avid  dealing  with 
unnecessarily  large  (trivial)  matrices. 


6.10  Case  Study  for  Dimension  Reduction 
(Parkinson’s  Disease) 

Step  1:  Collecting  Data 

The  data  we  will  be  using  in  this  case  study  is  the  Clinical,  Genetic  and  Imaging 
Data  for  Parkinson’s  Disease  in  the  SOCR  website.  A  detailed  data  explanation 
is  available  online  http://wiki.socr.umich.edu/index.php/SOCR_Data_PD_ 
BiomedBigMetadata.  Let’s  import  the  data  into  R. 
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#  Loading  required  package:  xml2 

wiki_urL  <-  read_htmi (" http :/ /wiki . socr . umich.edu/index.php/SOCR_Data_PD_Bio 
medBigMetadata ") 

htmi_nodes  ( \Aiiki_urij  "#content ") 

##  {xmL_nodeset  (1)} 

##  [1]  <div  id=" content"  c Las s="mw- body -primary"  roLe="main">\n\t<a  id="top 


pd_data  <-  htmi_tabie(htmi_nodes(\Aiiki_uriJ  "tabLe" ) [ [1] ] ) 
head(pd_data)j  summary (pd_data) 


## 

Cases 

L_caudate_ComputeArea 

L_  caudate_  Vo  L  ume 

R_caudate_ComputeArea 

##  1 

2 

597 

767 

855 

##  2 

2 

597 

767 

855 

##  3 

2 

597 

767 

855 

##  4 

2 

597 

767 

855 

##  5 

3 

604 

873 

935 

##  6 

3 

604 

873 

935 

## 

chr!7_rsll868035_GT  UPDRS_part_I 

UPDRS_part_II  UPDRS_part_ 

.HI 

Time 

## 

1 

0 

1 

12 

1 

0 

## 

2 

0 

1 

12 

1 

6 

## 

3 

0 

1 

12 

1 

12 

## 

4 

0 

1 

12 

1 

18 

## 

5 

1 

0 

19 

22 

0 

## 

6 

1 

0 

19 

22 

6 

## 

Cases 

L_ caudate_ComputeArea  L_ caudate_ Vo L ume 

## 

Min.  :  2.0 

Min. 

: 525.0 

Min.  : 719.0 

## 

1st  Qu. :158. 0 

1st  Qu. 

: 582.0 

1st  Qu. : 784.0 

## 

Median  :363.5 

Median 

: 600. 0 

Median  : 800.0 

## 

Mean  : 346.1 

Mean 

: 600.4 

Mean  : 800 . 3 

## 

3rd  Qu.: 504.0 

3rd  Qu. 

: 619.0 

3rd  Qu. :819. 0 

## 

Max.  : 692.0 

Max. 

:  667.0 

Max.  : 890.0 

## 

Min. 

:  0.0 

## 

1st  Qu. :  4.5 

## 

Median  :  9.0 

## 

Mean 

:  9.0 

## 

3rd  Qu. :13.5 

## 

Max. 

:18.0 

Step  2:  Exploring  and  Preparing  the  Data 

To  make  sure  that  the  data  is  ready  for  further  modeling,  we  need  to  fix  a  few  things. 
First,  the  Dx  variable,  or  diagnosis,  is  a  factor.  We  need  to  change  it  to  a  numeric 
variable.  Second,  we  don’t  need  the  patient  ID  and  time  variable  in  the  dimension 
reduction  procedures. 


pd_data$Dx  <-  gsub("PD"j  1}  pd_data$Dx) 
pd_data$Dx  <-  gsub("HC" j  0,  pd_data$Dx) 
pd_data$Dx  <-  gsub("S\4EDD" ,  0,  pd_data$Dx) 
pd_data$Dx  <-  as. numeric (pd_data$Dx) 
attach  (pd_data) 
pd_data<-pd_data[j  -c(lj  33) ] 
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Fig.  6.18  Barplot  illustrating  the  decay  of  the  eigenvectors  corresponding  to  the  PCA  linear 
transformation  of  the  variables  in  the  Parkinson’s  disease  dataset  (Figs.  6.19  and  6.20) 

Step  3:  Training  a  Model  on  the  Data 
1  PCA 

Now  we  start  the  process  of  fitting  a  PCA  model.  Here  we  will  use  the 
pr incomp  ( )  function  and  use  the  correlation  rather  than  the  covariance  matrix 
for  calculation. 

pco. model  <-  princomp(pd_dataj  cor=TRUE) 

summary (pea .model)  #  pc  loadings  (i.e.,  igenvector  columns) 

##  Importance  of  components : 

##  Comp . 1  Comp . 2  Comp . 3  Comp . 4 

##  Standard  deviation  1.39495952  1.28668145  1.28111293  1.2061402 

##  Proportion  of  Variance  0.06277136  0.05340481  0.05294356  0.0469282 

##  Cumulative  Proportion  0.06277136  0.11617617  0.16911973  0.2160479 

##  Comp . 5  Comp . 6  Comp . 7  Comp . 8  Comp . 9 

##  Standard  deviation  1.18527282  1.15961464  1.135510  1.10882348  1.0761943 

##  Proportion  of  Variance  0.04531844  0.04337762  0.041593  0.03966095  0.037361 
##  Cumulative  Proportion  0.26136637  0.30474399  0.346337  0.38599794  0.423359 
##  Comp. 10  Comp. 11  Comp. 12  Comp. 13 

##  Standard  deviation  1.06687730  1.05784209  1.04026215  1.03067437 

##  Proportion  of  Variance  0.03671701  0.03609774  0.03490791  0.03426741 

##  Cumulative  Proportion  0.46007604  0.49617378  0.53108169  0.56534910 

##  Comp. 14  Comp. 15  Comp. 16  Comp. 17 

##  Standard  deviation  1.0259684  0.99422375  0.97385632  0.96688855 

##  Proportion  of  Variance  0.0339552  0.03188648  0.03059342  0.03015721 

##  Cumulative  Proportion  0.5993043  0.63119078  0.66178421  0.69194141 

##  Comp. 18  Comp. 19  Comp. 20  Comp. 21 


6  Dimensionality  Reduction 


pca.model 


Comp.1  Comp. 2  Comp. 3  Comp. 4  Comp. 5  Comp. 6  Comp.7  Comp. 8  Comp.9  Comp. 10 
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##  Standard  deviation 
##  Proportion  of  Variance 
##  Cumulative  Proportion 
## 

##  Standard  deviation 
##  Proportion  of  Variance 
##  Cumulative  Proportion 
## 

##  Standard  deviation 
##  Proportion  of  Variance 
##  Cumulative  Proportion 
## 

##  Standard  deviation 
##  Proportion  of  Variance 
##  Cumulative  Proportion 


0.92687735 
0.02771296 
0.71965437 
Comp . 22 
0.87005195 
0.02441905 
0.82315289 


0.92376374 
0.02752708 
0.74718145 
Comp . 23 
0.86433816 
0.02409937 
0.84725226 


0.89853718 
0.02604416 
0.77322561 
Comp . 24 
0.84794183 
0.02319372 
0.87044598 


0.88924412 
0.02550823 
0. 79873384 
Comp . 25 
0.82232529 
0.02181351 
0.89225949 


Comp . 26 
0.80703739 
0.02100998 
0.91326947 


Comp. 27 
0.78546699 
0.01990188 
0.93317135 


Comp . 28 
0.77505522 
0.01937776 
0.95254911 


Comp . 29 
0.76624322 
0.01893963 
0.97148875 


Comp . 30 
0. 68806884 
0.01527222 
0.98676096 


Comp . 31 
0.64063259 
0.01323904 
1 . 00000000 


plot(pca .model) 


biplot (pea. model ) 

fviz_pca_biplot (pea. mode Lj  axes  =  c(l}  2) }  geom  =  "point" j 
col.ind  =  "black" }  col.var  =  "steelblue" j  label  =  "all"j 
invisible  =  "none"j  repel  =  F ,  habillage  =  pd_data$SeXj 
palette  =  NULLj  addEllipses  =  TRUEj  title  =  "PCA  -  Biplot") 
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Fig.  6.19  Biplot  of  the  PD  variables  onto  the  first  two  principle  axes 
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Fig.  6.20 


Enhanced  biplot  of  the  PD  data  explicitly  labeling  the  patients  and  control  volunteers 


We  can  see  that  in  real  world  examples  PCs  do  not  necessarily  have  an  “elbow”  in 
the  scree  plot  (Fig.  6.18).  In  our  model,  each  PC  explains  about  the  same  amount  of 
variation.  Thus,  it  is  hard  to  tell  how  many  PCs,  or  factors,  we  need  to  pick.  This 
would  be  an  ad  hoc  decision. 

2.  FA 

Let’s  set  up  a  Cattell’s  Scree  test  to  determine  the  number  of  factors  first. 


ev  <-  eigen ( cor ( pd_doto ) )  #  get  eigenvalues 

ap  <-  paraLLeL(subject=nrow(pd_data)j  var=ncoL(pd_data) j  rep=100j  cent=.05) 
nS  <-  nScree(x=ev$vaLueSj  aporaLLeL=op$eigen$qevpea) 
summary (nS) 


##  Report  For  a  nScree  Class 
## 

##  Details:  components 
## 

OC  Acc .factor  AF 
(<  OC)  NA  (<  AF) 

0 
0 
0 
0 


## 

Eigenvalues 

Prop 

Cumu 

Par .Analysis 

Pred. eig 

## 

1 

2 

0 

0 

1 

2 

## 

2 

2 

0 

0 

1 

2 

## 

3 

2 

0 

0 

1 

1 

## 

4 

1 

0 

0 

1 

1 

## 

5 

1 

0 

0 

1 

1 

6.10  Case  Study  for  Dimension  Reduction  (Parkinson’s  Disease) 


263 


##30  0  0  1  1  NA  0 

##31  0  0  1  1  NA  NA 

## 

## 

##  Number  of  factors  retained  by  index 
## 

##  noc  naf  nparaiiei  nkaiser 
##111  14  14 

Although  the  Cattell’s  Scree  test  suggest  that  we  should  use  14  factors,  the  real  fit 
shows  14  is  not  enough.  Previous  PC  A  results  suggest  we  need  around  20  PCs  to 
obtain  a  cumulative  variance  of  0.6.  After  a  few  trials,  we  find  that  19  factors  can 
pass  the  chi  square  test  for  sufficient  number  of  factors  at  0.05  level. 


fa.modeL<-factanaL(pd_dataj  19,  rotation="varimax" ) 
fa. mode i 


## 

##  Call: 

##  factanal(x  =  pd_dataj  factors 
## 

##  Uniquenesses : 

##  L_caudate_ComputeArea 

##  0.840 

##  R_caudate_ComputeArea 

##  0.868 
##  L_putamen_ComputeArea 

##  0.791 

##  R_putamen_ComputeArea 

##  0.615 

##  L_hippocampus_ComputeArea 
##  0.476 

##  R_hippocampus_ComputeArea 
##  0.798 

##  cerebellum_ComputeArea 

##  0.137 

##  L_lingual_gyrus_ComputeArea 
##  0.780 

##  R_lingual_gyrus_ComputeArea 
##  0.005 

##  L_fusiform_gyrus_ComputeArea 
##  0.718 

##  R_fusiform_gyrus_ComputeArea 


## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


0.663 

Sex 

0.829 

Age 

0.005 

chrl2_rs34637584_GT 

0.638 

UPDRS_part_I 

0.767 

UPDRS_part_III 

0.616 


19 j  rotation  =  "varimax") 


L_  caudate_  Vo  L  ume 
0.005 

R_ caudate_  Vo  L ume 
0.849 

L_putamen_  Vo l ume 
0.702 

R_putamen_  Vo L ume 
0.438 

L_hippocampus_Volume 

0.777 

R_hippocampus_Volume 

0.522 

cere be L L um_  Vo  L ume 
0.504 

L_Lingua L_gyrus_  Vo L ume 

0.698 

R_ Lingua l_gyrus_ Vo L ume 

0.005 

L_fus  iform_gyrus_  Vo  L  ume 

0.559 

R_fus  iform_gyrus_  Vo  L  ume 

0.261 

Weight 

0.005 

Dx 

0.005 

chrl7_rsll868035_GT 

0.721 

UPDRS_part_II 

0.826 
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## 


##  Loadings: 

## 

##  L_caudate_ComputeArea 

##  L_caudate_VoLume 

##  R_caudate_ComputeArea 

##  R_caudate_VoLume 

##  L_putamen_ComputeArea 

##  L_putamen_VoLume 

##  R_putamen_ComputeArea 

##  R_putamen_VoLume 

##  L_hippocampus_ComputeArea 

##  L_hippocampus_VoLume 

##  R_hippocampus_ComputeArea 

##  R_hippocampus_VoLume 

##  cerebeLLum_ComputeArea 

##  cerebeLLum_VoLume 

##  L_LinguaL_gyrus_ComputeArea 


Factorl  Factor2  Factor3  Factor4  Factor5 

0.980 


-0.102 


0.107 


##  L_LinguaL_gyrus_VoLume 
##  R_LinguaL_gyrus_ComputeArea 
##  R_LinguaL_gyrus_VoLume 
##  L_fusiform_gyrus_ComputeArea 
##  L_fusiform_gyrus_VoLume 
##  R_fusiform_gyrus_ComputeArea 
##  R_fusiform_gyrus_VoLume 
##  Sex 
##  Weight 
##  Age 

##  Dx  0.965 

##  chrl2_rs34637584_GT 

##  c hr  17_rs 1186803 5_GT  -0.303 


0.983 


-0.111 

0.983 


0.124 


0.989 


##  UPDRS_part_I 
##  UPDRS_part_II 
##  UPDRS_part_III 
## 


-0.260 


0.332  0.104 

Factor6  Factor7  Factor8  Factor9  Factorl0 


##  L_caudate_ComputeArea  -0.101 

##  L_caudate_VoLume 


##  Factorl  Factor2  Factor3  Factor4  Factor5  Factor6  Factor7 

##  SS  Loadings  1.282  1.029  1.026  1.019  1.013  1.011  0.921 

##  Proportion  Var  0.041  0.033  0.033  0.033  0.033  0.033  0.030 

##  CumuLative  Var  0.041  0.075  0.108  0.140  0.173  0.206  0.235 

##  Factor8  Factor9  Factorl0  Factorll  Factorl2  Factorl3 

##  SS  Loadings  0.838  0.782  0.687  0.647  0.615  0.587 

##  Proportion  Var  0.027  0.025  0.022  0.021  0.020  0.019 

##  CumuLative  Var  0.263  0.288  0.310  0.331  0.351  0.370 


##  Factorl4  Factorl5  Factorl6  Factorl7  Factorl8  Factorl9 
##  SS  Loadings  0.569  0.566  0.547  0.507  0.475  0.456 
##  Proportion  Var  0.018  0.018  0.018  0.016  0.015  0.015 
##  CumuLative  Var  0.388  0.406  0.424  0.440  0.455  0.470 


## 

##  Test  of  the  hypothesis  that  19  factors  are  sufficient . 

##  The  chi  square  statistic  is  54.51  on  47  degrees  of  freedom. 
##  The  p-vaLue  is  0.211 
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This  data  matrix  has  relatively  low  correlation.  Thus,  it  is  not  suitable  for  ICA. 


cor  (pd_data)  [1:10,  1:10] 


## 

L_caudate_ComputeArea 

L_  caudate_  Vo  i  ume 

##  L_caudate_ComputeArea 

1 . 000000000 

0.05794916 

##  L_caudate_VoLume 

0.057949162 

1 . 00000000 

##  R_caudate_ComputeArea 

-0.060576361 

0.01076372 

##  R_caudate_VoLume 

0.043994457 

0.07245568 

##  L_putamen_ComputeArea 

0.009640983 

-0.06632813 

##  L_putamen_VoLume 

-0.064299184 

-0.11131525 

##  R_putamen_ComputeArea 

0.040808105 

0.04504867 

##  R_putamen_VoLume 

0.058552841 

-0.11830387 

##  L_hippocampus_ComputeArea 

-0.037932760 

-0.04443615 

##  L_hippocampus_VoLume 

-0.042033469 

-0.04680825 

##  L_caudate_ComputeArea 

0.04080810 

0.058552841 

##  L_caudate_VoLume 

0.04504867 

-0.118303868 

##  R_caudate_ComputeArea 

0.07864348 

0. 007022844 

##  R_caudate_VoLume 

0.05428747 

-0.094336376 

##  L_putamen_ComputeArea 

0.09049611 

0.176353726 

##  L_putamen_VoLume 

0.09093926 

-0.057687648 

##  R_putamen_ComputeArea 

1 . 00000000 

0.052245264 

##  R_putamen_VoLume 

0.05224526 

1 . 000000000 

##  L_hippocampus_ComputeArea 

-0.05508472 

0.131800075 

##  L_hippocampus_VoLume 

-0.08866344 

-0.001133570 

## 


L_hippocampus_ComputeArea  L_hippocampus_VoLume 


##  L_caudate_ComputeArea 
##  L_caudate_VoLume 
##  R_caudate_ComputeArea 
##  R_caudate_VoLume 
##  L_putamen_ComputeArea 
##  L_putamen_VoLume 
##  R_putamen_ComputeArea 
##  R_putamen_VoLume 
##  L_hippocampus_ComputeArea 
##  L_hippocampus_VoLume 


-0.037932760 

-0.04203347 

-0.044436146 

-0.04680825 

0.051359613 

0.08578833 

0.006123355 

-0.07791361 

0.094604791 

-0.06442537 

0.025303302 

0.04041557 

-0.055084723 

-0.08866344 

0.131800075 

-0.00113357 

1 . 000000000 

-0.02633816 

-0.026338163 

1 . 00000000 

6.11  Assignments:  6.  Dimensionality  Reduction 
6.11.1  Parkinson ’s  Disease  Example 

Apply  principal  component  analysis  (PCA),  singular  value  decomposition  (SVD), 
independent  component  analysis  (ICA),  and  factor  analysis  (FA)  to  reduce  the 
dimensionality  of  the  PD  data.  Interpret  the  results. 
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6.11.2  Allometric  Relations  in  Plants  Example 

Load  Data 

Load  Allometric  Relations  in  Plants  data  and  perform  a  proper  type  conversion,  e.g., 
convert 4 ‘Province’ ’  and  “Born”. 


Dimensionality  Reduction 

•  Apply  Principal  Component  Analysis  protocol. 

•  Generate  a  data  summary 

•  Apply  prcomp 

•  Report  the  rotations  (scores) 

•  Display  screen  plot 

•  Select  the  number  of  PCs  and  employ  a  bootstrap  test 

•  Apply  f  actoextra  to  draw  biplot  and  grouped  by  Province/Sites 

•  Perform  SVD  and  ICA  and  compare  the  results  of  PCA. 

•  Use  these  three  variables  L ,  M ,  D  to  perform  ICA  and  show  pair  plots  before 
ICA  and  after  ICA.  (Hint:  scatter3dplot  ( )  may  be  helpful,  which  you 
saw  in  Chap.  5.) 

•  Perform  factor  analysis. 

•  Use  require  (nFactors)  to  determine  the  number  of  the  factors  and  show 
a  scree  plot  as  stated  in  notes 

•  Use  factanal  ()  to  apply  FA  and  compare  the  rotation  varimax  and 
promax 

•  Report  the  loadings  and  consider  an  appropriate  visualization  method. 

•  Interpret  the  findings  in  the  context  of  the  case-study. 
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In  the  next  several  Chapters,  we  will  concentrate  on  various  progressively  advanced 
machine  learning,  classification  and  clustering  techniques.  There  are  two  categories 
of  learning  techniques  we  wil  explore:  supervised  (human-guided)  classification  and 
unsupervised  (fully-automated)  clustering.  In  general,  supervised  classification  aims 
to  identify  or  predict  predefined  classes  and  label  new  objects  as  members  of  specific 
classes.  Whereas,  unsupervised  clustering  attempts  to  group  objects  into  sets,  with¬ 
out  knowing  a  priori  labels,  and  determine  relationships  between  objects. 

In  the  context  of  machine  learning,  classification  refers  to  supervised  learning  and 
clustering  to  unsupervised  learning. 

Unsupervised  classification  refers  to  methods  where  the  outcomes  (groupings 
with  common  characteristics)  are  automatically  derived  based  on  intrinsic  affinities 
and  associations  in  the  data  without  prior  human  indication  of  clustering. 
Unsupervised  learning  is  purely  based  on  input  data  (X)  without  corresponding 
output  labels.  The  goal  is  to  model  the  underlying  structure,  affinities,  or  distribution 
in  the  data  in  order  to  learn  more  about  its  intrinsic  characteristics.  It  is  called 
unsupervised  learning  because  there  are  no  a  priori  correct  answers  and  there  is  no 
human  guidance.  Algorithms  are  left  to  their  own  devises  to  discover  and  present  the 
interesting  structure  in  the  data.  Clustering  (discovers  the  inherent  groupings  in  the 
data)  and  association  (discovers  association  rules  that  describe  the  data)  represent 
the  core  unsupervised  learning  problems.  The  k-means  clustering  and  the  Apriori 
association  rule  provide  solutions  to  unsupervised  learning  problems. 

Supervised  classification  methods  utilize  user  provided  labels  representative  of 
specific  classes  associated  with  concrete  observations,  cases,  or  units.  These  training 
classes/outcomes  are  used  as  references  for  the  classification.  Many  problems  can  be 
addressed  by  decision-support  systems  utilizing  combinations  of  supervised  and 
unsupervised  classification  processes.  Supervised  learning  involves  input  variables 
(X)  and  an  outcome  variable  (T)  to  learn  mapping  functions  from  the  input  to  the 
output:  Y  =  /(X).  The  goal  is  to  approximate  the  mapping  function  so  that  when  it  is 
applied  to  new  (validation)  data  (Z)  it  (accurately)  predicts  the  (expected)  outcome 
variables  (T).  It  is  called  supervised  learning  because  the  learning  process  is 
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Table  7.1  Summary  of  supervised  classification  and  unsupervised  clustering  techniques 


Inference 

Outcome 

Supervised 

Unsupervised 

Classification 
&  prediction 

Binary 

Classification-rules,  OneR,  kNN, 
NaiveBayes,  Decision-Tree, 

C5.0,  AdaBoost,  XGBoost, 
LDA/QDA,  Logit/Poisson,  SVM 

Apriori,  Association-rules, 
k-Means,  NaiveBayes 

Classification 
&  prediction 

Categorical 

Regression  modeling  & 
forecasting 

Apriori ,  Association-rules, 
k-Means,  NaiveBayes 

Regression 

modeling 

Real 

quantitative 

LDA/QDA,  SVM,  Decision- 
Tree,  NeuralNet 

(MLR)  Regression  modeling, 
Regression  modeling  tree, 
Apriori! Association-rules 

supervised  by  initial  training  labels  guiding  and  correcting  the  learning  until  the 
algorithm  achieves  an  acceptable  level  of  performance. 

Regression  (output  variable  is  a  real  value)  and  classification  (output  variable  is  a 
category)  problems  represent  the  two  types  of  supervised  learning.  Examples  of 
supervised  machine  learning  algorithms  include  Linear  regression  and  Random 
forest.  Both  provide  solutions  for  regression  problems,  but  Random  forest  also 
provides  solutions  to  classification  problems. 

Just  like  categorization  of  exploratory  data  analytics  (Chap.  4)  is  challenging,  so 
is  systematic  codification  of  machine  learning  techniques.  Table  7.1  attempts  to 
provide  a  rough  representation  of  common  machine  learning  methods.  However,  it  is 
not  really  intended  to  be  a  gold- standard  protocol  for  choosing  the  best  analytical 
method.  Before  you  settle  on  a  specific  strategy  for  data  analysis,  you  should  always 
review  the  data  characteristics  in  light  of  the  assumptions  of  each  technique  and 
assess  the  potential  to  gain  new  knowledge  or  extract  valid  information  from 
applying  a  specific  technique  (Table  7.1). 

Many  of  these  will  be  discussed  in  later  Chapters.  In  this  Chapter,  we  will  present 
step-by-step  the  k-nearest  neighbor  (kNN)  algorithm.  Specifically,  we  will  show 
(1)  data  retrieval  and  normalization;  (2)  splitting  the  data  into  training  and  testing 
sets;  (3)  fitting  models  on  the  training  data;  (4)  evaluating  model  performance  on 
testing  data;  (5)  improving  model  performance;  and  (6)  determining  optimal  values 
of  k. 

In  Chap.  14,  we  will  present  detailed  strategies,  and  evaluation  metrics,  to  assess 
the  performance  of  all  clustering  and  classification  methods. 


7.1  Motivation 

Classification  tasks  could  be  very  difficult  when  the  features  and  target  classes  are 
numerous,  complicated,  or  extremely  difficult  to  understand.  In  those  scenarios 
where  the  items  of  similar  class  type  tend  to  be  homogeneous,  nearest  neighbor 
classifying  method  are  well-suited  because  assigning  unlabeled  examples  to  most 
similar  labeled  examples  would  be  fairly  easy. 
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Such  classification  methods  can  help  us  to  understand  the  story  behind  the 
complicated  case-studies.  This  is  because  machine  learning  methods  generally 
have  no  distribution  assumptions.  However,  this  non-parametric  manner  makes  the 
methods  rely  heavily  on  large  and  representative  training  datasets. 


7.2  The  kNN  Algorithm  Overview 

The  kNN  algorithm  involves  the  following  steps: 

1 .  Create  a  training  dataset  that  has  classified  examples  labeled  by  nominal  variables 
and  different  features  in  ordinal  or  numerical  variables. 

2.  Create  a  test  dataset  containing  unlabeled  examples  with  similar  features  as  the 
training  data. 

3.  Given  a  predetermined  number  k,  match  each  test  record  with  k  training  records 
that  are  “nearest”  in  similarity. 

4.  Assigning  the  class  that  contains  the  majority  of  the  k  training  records  to  the  test 
record. 

The  Fig.  7.1  demonstration  shows  the  dynamic  classification  of  the  mouse  location 
(v,  y)  coordinates  that  are  used  as  new  data.  You  can  specify  the  number  of  points  in) 
and  the  number  of  nearest  neighbors  (k).  The  app  automatically  computes  the  neigh¬ 
borhood  size  and  the  corresponding  label  (color)  for  the  mouse  location  and  draws  the 
connecting  edges  to  the  nearest  neighbors  showing  the  dynamic  classification  process. 


7.2.1  Distance  Function  and  Dummy  Coding 


How  to  measure  the  similarity  between  records?  We  can  measure  the  similarity  as 
the  geometric  distance  between  the  two  records.  There  are  many  distance  functions 
to  choose  from.  Traditionally,  we  use  Euclidean  distance  as  our  distance  function. 
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Fig.  7.1  Live  Demo:  k-nearest  neighbor  classification  webapp 
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If  we  use  a  line  to  link  the  two  dots  created  by  the  test  record  and  the  training 
record  in  n  dimensional  space,  the  length  of  the  line  is  the  Euclidean  distance. 
Suppose  a ,  b  both  have  n  features  with  coordinates  (<zls  a2, . . an)  and  (Zq,  b2, . . 
Z?n).  A  simple  Euclidian  distance  could  be  defined  by: 


dist(a ,  b)  =  y  (a\  -  Z?i)2  +  (^2  -  & 2 )2  +  Zu)2- 

When  we  have  nominal  features,  it  requires  a  little  trick  to  apply  the  Euclidean 
distance  formula.  We  could  create  dummy  variables  as  indicators  of  the  nominal 
feature.  The  dummy  variable  would  equal  to  one  when  we  have  the  feature  and  zero 
otherwise.  We  show  two  examples: 


Gender  = 


0  X  =  male 
1  X  =  female  ’ 


Cold  = 


0  Temp  >  37E 
1  Temp  <  37E  * 


This  allows  only  binary  expressions.  If  we  have  multiple  nominal  categories,  just 
make  each  one  as  a  dummy  variable  and  apply  the  Euclidean  distance. 


7.2.2  Ways  to  Determine  k 

The  parameter  k  could  be  neither  too  large  nor  too  small.  If  our  k  is  too  large,  the  test 
record  tends  to  be  classified  as  the  most  popular  class  in  the  training  records  rather 
than  the  most  similar  one.  On  the  other  hand,  if  the  k  is  too  small,  outliers  or  noisy 
data,  like  mislabeling  the  training  data,  might  lead  to  errors  in  predictions. 

A  common  practice  is  to  calculate  the  square  root  of  the  number  of  training 
examples  and  use  that  number  as  k. 

A  more  robust  way  would  be  to  choose  several  k’s  and  select  the  one  with  best 
classifying  performance. 


7.2.3  Rescaling  of  the  Features 

Different  features  might  have  different  scales.  For  example,  we  can  have  a  measure 
of  pain  scaling  from  one  to  ten  or  one  to  one  hundred.  They  could  be  transferred  into 
the  same  scale.  Re-scaling  can  make  each  feature  contribute  to  the  distance  in  a 
relatively  equal  manner. 
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7.2.4  Rescaling  Formulas 

1 .  min-max  normalization 


_  X-  min(X) 

Xfiew  /  T7-\  •  / 

max(X)  —  min(X) 

After  re-scaling,  Xnew  would  range  between  0  and  1.  It  measures  the  distance 
between  each  value  and  its  minimum  as  a  percentage.  The  larger  a  percentage  the 
further  a  value  is  from  the  minimum.  100%  means  that  the  value  is  at  the  maximum. 

2.  z- Score  Standardization 


_X-fi_X-  Mean(X ) 

Xnew 

<7 

This  is  based  on  the  properties  of  normal  distribution  that  we  have  talked  about  in 
Chap.  3.  After  z-score  standardization,  the  re-scaled  feature  will  have  unbounded 
range.  This  is  different  from  the  min-max  normalization  that  has  a  limited  range  from 
0  to  1.  However,  after  z-score  standardization,  the  new  X  is  assumed  to  follow  a 
standard  normal  distribution. 


7.3  Case  Study 

7.3.1  Step  1:  Collecting  Data 

The  data  we  are  using  for  this  case  study  is  the  “Boys  Town  Study  of  Youth 
Development”,  which  is  the  second  case  study,  CaseStudy02_Boystown_Data.csv. 
Variables: 

•  ID:  Case  subject  identifier. 

•  Sex:  dichotomous  variable  (1  =  male,  2  =  female). 

•  GPA:  Interval-level  variable  with  range  of  0-5  (0-"  A"  average,  1-  "B"  average,  2- 
"C"  average,  3-  "D"  average,  4-"E",  5-"F""). 

•  Alcohol  use:  Interval  level  variable  from  0  to  1 1  (drink  everyday  -  never  drinked). 

•  Attitudes  on  drinking  in  the  household:  Alcatt-  Interval  level  variable  from  0  to 
6  (totally  approve  -  totally  disapprove). 

•  Dadjob:  1  yes,  dad  has  a  job:  and  2-  no. 

•  Momjob:  l-yes  and  2-no. 

•  Parent  closeness  (example:  In  your  opinion,  does  your  mother  make  you  feel 
close  to  her?) 

-  Dadclose:  Interval  level  variable  0-7  (usually-never) 

-  Momclose:  interval  level  variable  0-7  (usually-never). 
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•  Delinquency: 

-  larceny  (how  many  times  have  you  taken  things  >$50?):  Interval  level  data 
0-4  (never  -  many  times), 

-  vandalism:  Interval  level  data  0-7  (never  -  many  times). 


7.3.2  Step  2:  Exploring  and  Preparing  the  Data 

First,  we  need  to  load  in  the  data  and  do  some  data  manipulation.  We  are  using  the 
Euclidean  distance,  so  dummy  variable  should  be  used.  The  following  code  transfers 
sex,  dadj  ob  and  momj  ob  into  dummy  variables. 


boy stowrx- read. csv( "https : //umich .instructure . com/fiLes/399119/downLoad?down 

Load_frd=l"j  sep="  ") 

boystoiAin$sex<-boystoiAm$sex-l 

boysto\Ain$dadjob<--l*(boysto\Ain$dadjob-2  ) 

boystoiA/n$momj  ob<--l*(  boys  townSmomj  ob-2) 

str(boystoiAjn) 


## 

'  data  .frame  ' : 

200  obs. 

of 

11  variables : 

## 

$  id 

int 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10  .  . 

## 

$  sex 

num 

0 

0 

0 

0 

1 

1 

0 

0 

1 

1  ... 

## 

$  gpo 

int 

5 

0 

3 

2 

3 

3 

1 

5 

1 

3  ... 

## 

$  Alcoholuse 

int 

2 

4 

2 

2 

6 

3 

2 

6 

5 

2  ... 

## 

$  alcatt 

int 

3 

2 

3 

1 

2 

0 

0 

3 

0 

1  ... 

## 

$  dadjob 

num 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1  ... 

## 

$  momjob 

num 

0 

0 

0 

0 

1 

0 

0 

0 

1 

1  ... 

## 

$  dadclose 

int 

1 

3 

2 

1 

2 

1 

3 

6 

3 

1  ... 

## 

$  momclose 

int 

1 

4 

2 

2 

1 

2 

1 

2 

3 

2  ... 

## 

$  Larceny 

int 

1 

0 

0 

3 

1 

0 

0 

0 

1 

1  ... 

## 

$  vandalism 

int 

3 

0 

2 

2 

2 

0 

5 

1 

4 

0  ... 

The  str  ()  function  reports  that  we  have  200  observations  and  11  variables. 
However,  the  ID  variable  is  not  important  in  this  case  study  so  we  can  delete  it.  The 
variable  of  most  interest  is  GPA.  We  can  classify  it  into  two  categories.  Whoever 
gets  a  "C"  or  higher  will  be  classified  into  the  "above  average"  category;  Students 
who  have  average  score  below  "C"  will  be  in  the  "average  or  below"  category.  These 
two  are  the  classes  of  interest  for  this  case  study. 


boystoiAjn<-boystoiAjn[j  - 1 ] 
table (boystown$gpa) 

## 

##  0  1  2  3  4  5 
##  30  50  54  40  14  12 

boystown$grade<-boystown$gpa  %in%  c(3}  4 ,  5) 

boystown$grade< - factor (boy stownSgrade,  Levels=c(F ,  7),  labels  =  c("above_avg 
"j  "avg_or_beloiA/" ) ) 
table (boy stown$grade) 

## 

##  above_avg  avg_or_below 
##  134  66 
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Let’s  look  at  the  proportions  for  the  two  categorizes. 

round ( prop . tab Le( tab Le( boys town$grade ) ) *100j  digit s=l) 

##  above  avg  avg  or  below 
##  67  33 

We  can  see  that  most  of  the  students  are  above  average  (67%). 

The  remaining  ten  features  are  all  numeric  but  with  different  scales.  If  we  use 
these  features  directly,  the  ones  with  larger  scale  will  have  a  greater  impact  on  the 
classification  performance.  Therefore,  re-scaling  is  needed  in  this  scenario. 

summary(boystown[c("ALcohoLuse"j  " Larceny" j  "vandalism" ) ] ) 


## 

Alcoholuse 

Larceny 

vandalism 

## 

Min. 

:  0.00 

Min.  :0.00 

Min. 

:0.0 

## 

1st  Qu. 

:  2.00 

1st  Qu. :0.00 

1st  Qu. : 1.0 

## 

Median 

:  4.00 

Median  :1.00 

Median  :2.0 

## 

Mean 

:  3.87 

Mean  :0.92 

Mean 

:  1.9 

## 

3rd  Qu. 

:  5.00 

3rd  Qu. :1.00 

3rd  Qu. : 3.0 

## 

Max. 

:11.00 

Max.  :4.00 

Max. 

:  7. 0 

7.3.3  Normalizing  Data 

First  let’s  create  a  function  of  our  own  using  the  min-max  normalization  formula.  We 
can  check  the  function  using  some  trial  vectors. 

norma Lize< -function (x) { 

#  be  careful,  the  denominator  may  be  trivial! 
return ( ( x-min (x) )/(max(x ) -min (x))) 

} 

#  some  test  examples: 
normalize(c(lj  2 ,  3,  4,  5)) 

##  [1]  0.00  0.25  0.50  0.75  1.00 

normalize(c(lj  3 ,  6,  7,  9)) 

##  [1]  0.000  0.250  0.625  0.750  1.000 

After  confirming  that  it  is  working  properly,  we  use  the  1  apply  ( )  function  to 
apply  the  normalization  to  each  element  in  a  “list.”  First,  we  need  to  make  our  dataset 
into  a  list.  The  as  .  data  .  frame  ( )  function  converts  our  data  into  a  data  frame, 
which  is  a  list  of  equal-length  column  vectors.  Thus,  each  feature  is  an  element  in  the 
list  that  we  can  apply  the  normalization  function  to. 

boystown_n<-as.data.frame(lapply(boystown[-ll]j  normalize) ) 

Let’s  see  one  of  the  features  that  have  been  normalized. 
summary  (  boys  town_n$A  L  coho  Luse) 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  0.0000  0.1818  0.3636  0.3518  0.4545  1.0000 

This  looks  great!  Now  we  can  move  to  the  next  step. 
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7.3.4  Data  Preparation:  Creating  Training  and  Testing 
Datasets 

We  have  200  observations  in  this  dataset.  The  more  data  we  use  to  train  the 
algorithm,  the  more  precise  the  prediction  would  be.  We  can  use  3/4  of  the  data 
for  training  and  the  remaining  1/4  for  testing. 


#  Ideally,  we  want  to  randomly  split  the  raw  data  into  training  and  testing 

#  For  example:  80%  training  +  20%  testing 

#  subset_int  <-  sample(nrow(boystown_n) ,  floor(nrow(boystown_n)*0.8) ) 

#  bt_train<-  boystown_n  [subset_int,  ];  bt_test<-boystown_n[-subset_int,  ] 

#  Below,  we  use  a  simpler  3:1  split  for  simplicity 
bt_train<-boystown_n[l : 150j  -11] 
bt_test<-boystown_n[151 :200j  -11] 

The  following  step  is  to  extract  the  labels  or  classes  (column  =  11,  Delinquency 
in  terms  of  reoccurring  vandalism)  for  our  two  subsets. 

bt_train_LabeLs<-boystoiA/n[l :  150j  11] 
bt_test_LabeLs<-boystown[151 :200,  11] 

7.3.5  Step  3:  Training  a  Model  On  the  Data 

We  are  using  the  class  package  for  the  kNN  algorithm  in  R. 


#install.packages( ' class ' ,  repos  =  "http://cran.us.r-project.org") 

Library ( class) 

The  function  knn  ( )  has  following  components: 

p<-knn(train,  test,  class,  k) 

•  train:  data  frame  containing  numeric  training  data  (features) 

•  test:  data  frame  containing  numeric  testing  data  (features) 

•  class/cl:  class  for  each  observation  in  the  training  data 

•  k:  predetermined  integer  indication  the  number  of  nearest  neighbors 

The  first  k  we  chose  shoud  be  the  square  root  of  our  number  of  observations: 
14. 


bt_test_pred<-knn(train=bt_trainj  test=bt_testj  cl=bt_train_labelSj  k=14) 


7.3.6  Step  4:  Evaluating  Model  Performance 

We  utilize  the  CrossTable  ( )  function  in  Chap.  3  to  evaluate  the  kNN  model.  We 
have  two  classes  in  this  example.  The  goal  is  to  create  a  2  x  2  table  that  shows  the 
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matched  true  and  predicted  classes,  as  well  as  the  unmatched  ones.  However 
chi-square  values  are  not  needed,  so  we  use  option  vprop.chisq=False’  to  suppress 
reporting  them. 

#  install . packages ( "gmodels" ,  repos=" http: //c ran . us . r- project .org" ) 
l ibrary ( gmode is) 

CrossTabie(x=bt_test_LabeLSj  y=bt_test_predj  prop. chisq  =  F) 


## 

## 

##  Cell  Contents 

##  I - / 

##  /  N  I 

##  /  N  /  Row  Total  / 

##  /  N  /  Col  Total  I 

##  I  N  /  Table  Total  / 

##  I - / 

## 

## 


## 

## 

## 

## 

## 

Total  Observations  in  Table:  50 

1  bt_test_pred 

bt  test  Labels  /  above  avq  /  avq 

or  below  \ 

Row  Total  1 

## 

. — . /— 

- . 

- . 

- / 

## 

above  avg  / 

30  1 

e  / 

30  / 

## 

/ 

1.000  1 

0.000  / 

0.600  1 

## 

/ 

0.769  1 

0.000  / 

1 

## 

/ 

0.600  1 

0.000  / 

1 

## 

- /— 

- /— - 

- . 

- / 

## 

avg  or  below  \ 

9  / 

11  1 

20  / 

## 

i 

0.450  1 

0.550  1 

0.400  / 

## 

i 

0.231  1 

1.000  1 

/ 

## 

i 

0.180  1 

0.220  1 

/ 

## 

i— 

- /— - 

- . 

- / 

## 

Column  Total  / 

39  / 

11  1 

50  / 

## 

/ 

0.780  1 

0.220  1 

/ 

## 

- /— 

- /— - 

- . 

- / 

From  the  table,  the  diagonal  first  row  first  cell  and  the  second  row  second  cell 
contain  the  counts  for  records  that  have  predicted  classes  matching  the  true  classes. 
The  other  two  cells  are  the  counts  for  unmatched  cases.  The  accuracy  in  this  case  is 
calculated  by:  celliu{j^Jll\2'2^  This  accuracy  will  vary  each  time  we  run  the  algorithm. 
In  this  situation,  we  got  accuracy  =  =  0.82,  however,  a  previous 

run  generated  an  accuracy  =  cg//^1,1^g//^2,2^  =  =  0.76. 


7.3.7  Step  5:  Improving  Model  Performance 


The  above  Normalization  may  be  suboptimal.  We  can  try  an  alternative  stan¬ 
dardization  method,  e.g.,  standard  Z-score  centralization  and  normalization  (via 
scale  ( )  method).  Let’s  give  it  a  try: 
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bt_z< -as . data . frame ( sea Le ( boy stown [ ,  -11]) ) 
summary (bt_z$ALcohoLuse) 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  -2.04800  -0.98960  0.06879  0.00000  0.59800  3.77300 

The  summary  ( )  shows  the  re-scaling  is  working  properly.  Then,  we  can 
proceed  to  next  steps  (retraining  the  kNN,  predicting  and  assessing  the  accuracy  of 
the  results): 


bt_train<-bt_z[l : 150j  -11] 
bt_test<-bt_z[151:200j  -11] 
bt_train_LabeLs<-boystoiA/n[l :  150j  11] 
bt_test_labels<-boystown[151 :200}  11] 
bt_test_pred<-knn( train=bt_train }  test=bt_testj 

cL=bt_train_LabeLSj  k=14 ) 

CrossTabie(x=bt_test_iabeiSj  y=bt_test_predj  prop . chisq  =  F) 


## 

## 

##  Ceil  Contents 

##  I - / 

##  /  N  I 

##  I  N  /  Row  Total  / 

##  /  N  /  Col  Total  I 

##  I  N  /  Table  Total  / 

##  I - / 

## 

## 


## 

Total  Observations 

in  Table:  50 

## 

## 

## 

1  bt 

_test_pred 

## 

bt  test  Labels  \ 

above  avq  /  avg 

or  below  / 

flow  Total  1 

## 

- /— 

- /-— 

- /- 

- / 

## 

above  avg  / 

30  / 

e  / 

30  / 

## 

/ 

1.000  / 

0 . 000  / 

0.600  / 

## 

/ 

0.769  / 

0.000  / 

/ 

## 

/ 

0.600  / 

0.000  / 

/ 

## 

/— 

- /— - 

- /- 

- / 

## 

avg  or  below  / 

9  / 

11  / 

20  / 

## 

/ 

0.450  / 

0.550  / 

0.400  / 

## 

/ 

0.231  / 

1.000  / 

/ 

## 

/ 

0.180  / 

0.220  / 

/ 

## 

- /— 

- /— - 

- /- 

- / 

## 

Column  Total  \ 

39  / 

11  / 

50  / 

## 

i 

0.780  / 

0.220  / 

/ 

## 

- /— 

- /— - 

- /- 

- / 

Under  the  z-score  method,  the  prediction  result  is  similar  to  the  previous  run. 


7.3.8  Testing  Alternative  Values  of  k 


Originally,  we  used  the  square  root  of  200  as  our  k.  However,  this  might  not  be  the 
best  k  in  this  case  study.  We  can  test  different  Us  for  their  predicting  performance. 
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bt_train< -  boy  stown_n [1 : 150 ,  -11] 
bt_test<-boystown_n[151 :200}  -11] 
bt_train_labels<-boystown[l:150j  11] 
bt_test_labels<-boystown[151 :200}  11] 
bt_test_predl< - knn (train=bt_train ,  test=bt_testj 

cL=bt_train_LabeLSj  k=l) 

bt_test_pred5< -knn (train=bt_train ,  test=bt_testj 

cL=bt_train_LabeLSj  k=5) 

bt_test_predll<-knn( train=bt_train }  test=bt_testj 

cL=bt_train_LabeLSj  k=ll ) 
bt_test_pred21<-knn( train=bt_train }  test=bt_testj 

cL=bt_train_LabeLSj  k=21 ) 
bt_test_pred27<-knn( train=bt_trairij  test=bt_testj 

cL=bt_train_LabeLSj  k-27 ) 

ct_l<-CrossTabLe(x=bt_test_LobeLSj  y=bt_testjpredl}  prop. chisq  =  F) 


##  Ceil  Contents 

##  I - / 

##  /  N  I 

##  /  N  /  Row  Total  / 

##  /  N  /  Col  Total  I 

##  /  N  /  Table  Total  / 

##  / - / 

## 

## 


## 

## 

## 

## 

## 

Total  Observations  in  Table:  50 

1  bt_test_predl 

bt  test  labels  /  above  avq  /  a\zg 

or  below  / 

/?ow  Total  1 

## 

- - /— 

- . /— ~ 

- /- 

- / 

## 

above  avg  / 

27  / 

3  / 

30  / 

## 

/ 

0.900  / 

0.100  / 

0.600  / 

## 

/ 

0.810  / 

0.176  / 

/ 

## 

/ 

0.540  / 

0.060  / 

/ 

## 

- /— 

- /— - 

- /- 

- / 

## 

avg  or  below  / 

6  / 

14  / 

20  / 

## 

/ 

0.300  / 

0.700  / 

0.400  / 

## 

/ 

0.182  / 

0.824  / 

/ 

## 

/ 

0.120  / 

0.280  / 

/ 

## 

- /— 

- /-— 

- /- 

- / 

## 

Column  Total  / 

33  / 

17  / 

50  / 

## 

/ 

0.660  / 

0.340  / 

/ 

## 

- /— 

- /— - 

- /- 

- / 

## 


ct_5< -CrossTable (x=bt_test_labels j  y=bt_testjpredS } 
prop. chisq  =  F) 


##  Cell  Contents 

##  j - / 

##  /  N  I 

##  /  N  /  Row  Total  / 

##  /  N  /  Col  Total  I 

##  I  N  /  Table  Total  / 

##  j - / 
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## 

## 

## 

##  Total  Observations  in  Table:  50 
## 

## 

##  I  bt_test_pred5 

##  bt_test_labels  /  above_avg  /  avg_or_belo\A>  /  flow  Total  / 

## - / - / - / - / 

##  above_avg  /  30  /  0  /  30  / 

##  /  1.000  j  0.000  I  0.600  I 

##  I  0.857  I  0.000  I  I 

##  I  0.600  I  0.000  I  I 

## - / - / - / - / 

##  avg_or_below  /  5  /  15  /  20  / 

##  /  0.250  /  0.750  /  0.400  / 

##  /  0.143  /  1.000  /  / 

##  /  0.100  /  0.300  /  / 

## - / - / - / - / 

##  CO/.L//77H  rota/.  /  35  I  15  I  50  I 

##  /  0.700  /  0.300  /  / 

## - / - / - / - / 

## 

c t_l  1  < - CrossTab Le( x=bt_ tes t_labelSj  y=b t_ tes t_predl  1 , 
prop. chisq  =  F) 

##  Cell  Contents 

##  j - / 

##  /  /V  / 

##  /  /V  /  flow  Total  I 

##  /  /V  /  Col  Total  I 

##  /  /V  /  rahte  Total  / 

##  / - / 

## 

##  Total  Observations  in  Table:  50 
## 

##  /  bt_test_predll 

##  bt_test_labels  /  above_avg  /  avg_or_below  /  flow  Total  / 

## - / - / - / - / 

##  above_avg  /  30  /  0  /  30  / 

##  /  1.000  /  0.000  /  0.600  / 

##  /  0.769  /  0.000  /  / 

##  /  0.600  /  0.000  /  / 

## - / - / - / - / 

##  avg_or_beloiAj  /  9  /  11  /  20  / 

##  /  0.450  /  0.550  /  0.400  / 

##  /  0.231  /  1.000  /  / 

##  /  0.100  /  0.220  /  / 

## - / - / - / - / 
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## 

Column  Total  / 

39  / 

11  1 

50  1 

## 

/ 

0.780  1 

0.220  1 

1 

##  - 

/ 

1 

- / - 

- / 

ct_21<-CrossTabLe(x=bt_ 

test  labels ,  y=bt 

test_pred21j 

prop. chisq  = 

-  F) 

## 

Cell  Contents 

##  1 

- / 

##  1 

N  1 

##  1 

N  /  Row 

Total  1 

##  1 

N  /  Col 

Total  1 

##  1 

N  /  Table 

Total  1 

##  1 

- 1 

## 

## 

##  Total  Observations  in  Table:  50 

## 

## 

## 

1  bt_ 

test_pred21 

##  bt_test_labels  \ 

above  avg  \  avg  or 

•  below  1  Row 

Total  1 

##  - 

/-  — 

1 

- / - 

- 1 

## 

above  avg  / 

30  1 

0  1 

30  1 

## 

/ 

1.000  1 

0.000  1 

0.600  1 

## 

/ 

0.714  1 

0.000  1 

1 

## 

/ 

0.600  1 

0.000  1 

1 

##  - 

- /--■ 

- / - 

- / - 

- / 

## 

avg  or  below  \ 

12  1 

8  1 

20  1 

## 

1 

0.600  1 

0.400  1 

0.400  1 

## 

1 

0.286  1 

1.000  1 

1 

## 

1 

0.240  1 

0.160  1 

1 

##  - 

- 

- / - 

- / - 

- / 

## 

Column  Total  \ 

42  1 

8  1 

50  1 

## 

1 

0.840  1 

0.160  1 

1 

##  - 

- /--- 

- / - 

- / - 

- / 

## 

## 


ct_27<-CrossTabie(x=bt_test_iabeiSj  y=bt_test_pred27j prop .  chisq  =  F) 


## 

## 

##  Cell  Contents 


##  i 
##  1 

- 1 

N  1 

##  1 

N  /  Row 

Total  1 

##  1 

N  /  Col 

Total  1 

##  1 

N  /  Table 

Total  1 

##  1 

- / 

## 

## 

##  Total  Observations  in  Table:  50 

## 

## 

## 

1  bt_ 

_test_pred27 

##  bt_test_labels  / 

above  avq  \  avq 

or  below  \ 

Row  Total  1 

##  - 

- /--- 

- /---- 

- /- 

- / 

## 

above  avg  \ 

30  1 

e  1 

30  1 

## 

1 

1.000  1 

0.000  1 

0.600  1 

## 

1 

0.682  1 

0.000  1 

1 

## 

1 

0.600  1 

0.000  1 

1 

##  - 

- 

- /---- 

- /- 

- / 

## 

avg  or  below  \ 

14  1 

6  / 

20  1 

## 

1 

0.700  1 

0.300  1 

0.400  1 

## 

1 

0.318  1 

1.000  1 

1 

## 

1 

0.280  1 

0.120  1 

1 

##  - 

/--- 

- /---- 

- /- 

- / 

## 

Column  Total  \ 

44  / 

6  / 

50  1 

## 

1 

0.880  1 

0.120  1 

1 

##  - 

/ 

- /---- 

- /- 

- / 

The  choice  of  k  in  kNN  clustering  is  very  important. 
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#  install . packages( "el071" ) 

Library (el071 ) 

knntuning  =  tune ,knn(x=  bt_trainj  y  =  bt_train_LabeiSj  k  =  1:30) 
knntuning 


## 

##  Parameter  tuning  of  ' knn .wrapper ' : 

## 

##  -  sampling  method:  10-foLd  cross  validation 
## 

##  -  best  parameters : 

##  k 
##  9 

## 

##  -  best  performance :  0.1733333 
summary (knntuning ) 


##  Parameter  tuning  of  ' knn. wrapper ' : 

##  -  sampling  method:  10-foLd  cross  validation 
best  parameters : 
k 
9 

best  performance :  0.1733333 
Detailed  performance  results: 


## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
##  8 
##  9 


1 

2 

3 

4 

5 

6 
7 


k  error  dispersion 

1  0.2400000  0.08432740 

2  0.2800000  0.10795518 

3  0.2133333  0.10327956 

4  0.2466667  0.04499657 

5  0.1866667  0.10327956 

6  0.1866667  0.11243654 

7  0.1800000  0.10446808 


8  0.1866667  0.12090196 

9  0.1733333  0.10976968 
##  10  10  0.2133333  0.12090196 
##  11  11  0.2266667  0.12649111 
##  12  12  0.2066667  0.11088867 
##  13  13  0.2133333  0.11243654 
##  14  14  0.2266667  0.13033670 
##  15  15  0.2133333  0.12090196 
##  16  16  0.2133333  0.09838197 
##  17  17  0.2200000  0.10909278 
##  18  18  0.2266667  0.11842589 
##  19  19  0.2200000  0.10909278 
##  20  20  0.2333333  0.11439589 
##  21  21  0.2333333  0.11439589 
##  22  22  0.2200000  0.08916623 
##  23  23  0.2533333  0.10327956 
##  24  24  0.2466667  0.10446808 
##  25  25  0.2466667  0.11779874 
##  26  26  0.2600000  0.11088867 
##  27  27  0.2533333  0.11674600 
##  28  28  0.2666667  0.10886621 
##  29  29  0.2866667  0.11352924 
##  30  30  0.2800000  0.11674600 


It’s  useful  to  visualize  the  error  rate  against  the  value  of  k.  This  can  help  us 
select  a  k  parameter  that  minimizes  the  cross-validation  (CV)  error  (Fig.  7.2). 
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Number  of  nearest  neighbors  ($k$) 


Train 

CV 

Test 


Fig.  7.2  Classification  error  plots  (y-axis)  for  training  data  (red),  internal  statistical  cross-validation 
(green)  and  external  out  of  box  data  (blue)  against  different  /^-parameters  of  the  kNN  method 


Library (class) 

Library (ggpLot2) 

#  define  a  function  that  generates  CV  folds 
cv_partition  <-  function(yj  num_foLds  =  10 ,  seed  =  NULL)  { 

if(!is.nuLL(seed))  { 
set.seed(seed) 

} 

n  <-  Length (y) 

foLds  <-  split (sample (seq_Len (n) j  n)j  gL(n  =  num_foLdSj  k=l}  Length=n)) 
foLds  <-  LappLy(foLdSj  function(foLd)  { 

List( 

training  =  which ( ! seq_aLong(y)  %in%  foLd)j 
test  =  foLd 

) 

}) 

names(foLds)  <-  pasted ( "FoLd" ,  names (foLds)) 
return(foLds) 

} 

#  Generate  10-folds  of  the  data 

foLds  =  cv_partition(bt_train_LabeLSj  num_foLds  =  10) 
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#  Define  a  trainingset_CV_error  calculation  function 
train_cv_error  =  function(K)  { 

#Train  error 

hnnbt  =  knn(train  =  bt_trairij  test  =  bt_troirij 

cL  =  bt_troin_iobeiSj  k  =  K) 
train_error  =  mean(knnbt  !=  bt_train_iabeis) 

#CV  error 

cverrbt  =  sappLy(foLdSj  function(foLd)  { 

mean (bt_train_iabeis [foLd$test]  !=  knn(train=bt_train[fold$training, ], 
cL  =  bt_train_iabeis[foid$training]j  test  =  bt_train [fold$test, ] ,  k=K)) 

} 

) 

cv_error  =  mean(cverrbt) 

#Test  error 

knn.test  =  knn(train  =  bt_train3  test  =  bt_test3 
cL  =  bt_train_iabeiSj  k  =  K) 
test_error  =  mean(knn .test  /=  bt_test_LabeLs) 
return(c(train_errorj  cv_error3  test_error)) 

} 

k_err  =  sappLy(l : 30,  function(k)  train_cv_error(k)) 
df_errs  =  data,  frame (t(k_err) ,  1:30) 
coLnames(df_errs)  =  c( ' Train ' ,  ' CV' ,  'Test',  'K') 

require(ggpLot2) 

Library (reshape2) 

dataL  <-  meLt(df_errSj  id="K") 

ggpLot(dataLj  aes_string(x="K" ,  y="vaLue ",  coLour="variabLe", 
group="variabLe" ,  Linetype="variabLe" ,  shape="variabLe" ) )  + 
geom_Line(size=0. 8)  +  Labs(x  =  "Number  of  nearest  neighbors  ($k$)", 
y  =  "Classification  error", 
coLour="" ,  group= 

Linetype=" ",  shape="")  + 
geom_point(size=2. 8)  + 

geom_vLine(xintercept=4: 5,  colour  =  "pink")+ 
geom_text(aes (4,0,  Label  =  "4",  vjust  =  1))  + 
geom_text(aes(5,0.  Label  =  "5",  vjust  =  1)) 


7.3.9  Quantitative  Assessment  ( Tables  7.2  and  7.3) 

The  reader  should  first  review  the  fundamentals  of  hypothesis  testing  inference. 
Table  7.2  shows  the  basic  components  of  binary  classification,  and  Table  7.3  reports 
the  results  of  the  classification  for  several  k  values. 


Table  7.2  Basic  evaluation  metrics  of  binary  classification 


kNN  fails  to  reject 

TN 

FN 

kNN  rejects 

FP 

TP 

Specificity:  TN/(TN  +  FP) 

Sensitivity:  TP/(TP  +  FN) 
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Table  7.3  Summary  results  of  the  kNN  classification  for  different  values  of  the  parameter  k 


k  value 

Total  unmatched  counts 

Accuracy 

1 

9 

0.82 

5 

5 

0.90 

11 

9 

0.82 

21 

12 

0.76 

27 

14 

0.72 

Suppose  we  want  to  evaluate  the  kNN  model  (5)  as  to  how  well  it  predicts  the 
below-average  boys.  Let’s  report  manually  some  of  the  accuracy  metrics  for 
model5.  Combining  the  results,  we  get  the  following  sensitivity  and  specificity: 

#  bt_test_pred5<-knn(train=bt_tnainJ  test=bt_test ^  cl=bt_tnain_labels ,  k=5) 

#  ct_5<-CrossTable(x=bt_test_labels ,  y=bt_test_pred5,  prop.chisq  =  F) 
mod5_TN  <-  ct_5$prop .  row[lj  1] 

mod5_FP  <-  ct_5$prop. row[lj  2] 
mod5_FN  <-  ct_5$prop . row[2j  1] 
mod5_TP  <-  ct_5$prop.  row[2j  2] 

mod5_sensi  <-  mod5_TN/ (mod5_TN+mod5_FP) 
mod5_speci  <-  mod5_TP/(mod5_TP+mod5_FN) 
print (paste0( "mod5_sensi=" j  mod5_sensi ) ) 

##  [1]  "mod5_sensi=l" 

print(paste0("mod5_speci="j  mod5_speci ) ) 

##  [1]  "mod5_speci=0 . 75" 


Therefore,  model5  yields  a  good  choice  for  the  number  of  clusters  k  =  5. 
Nevertheless,  we  can  always  examine  further  near  5  to  get  potentially  better  choices 
of  k. 

Another  strategy  for  model  validation  and  improvement  involves  the  use  of  the 
confusionMatrix  ( )  method,  which  reports  several  complementary  metrics 
quantifying  the  performance  of  the  prediction  model. 

Let’s  focus  on  model5  power  to  predict  Delinquency  in  terms  of  reoccurring 

vandalism. 
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corr5  <-  cor (as. numeric (bt_test_LabeLs)  j  as. numeric (bt_test_pred5) ) 
corr5 

##  [1]  0.8017837 

#  plot (as . numeric(bt_test_labels) ,  as . numeric(bt_test_pred5) ) 

#  install. packages( "caret" ) 

Library ("caret") 

##  Loading  required  package:  Lattice 

#  compute  the  accuracy,  LOR,  sensitivity/specificity  of  3  kNN  models 

#  Model  1:  bt_test_predl 

confusionMatrix( as . numeric ( bt_test_LabeLs ) ,  as. numeric ( bt_test_predl ) ) 

##  Confusion  Matrix  and  Statistics 
## 

##  Reference 

##  Prediction  1  2 

##  1  27  3 

##  2  6  14 

## 

##  Accuracy 

##  95%  Cl 

##  No  Information  Rate 

##  P-VaLue  [Acc  >  NIR] 

## 

##  Kappa 

##  Mcnemar's  Test  P-VaLue 
## 

##  Sensitivity 

##  Specificity 

##  Pos  Pred  VaLue 

##  Neg  Pred  VaLue 

##  PrevaLence 

##  Detection  Rate 

##  Detection  PrevaLence 
##  BaLanced  Accuracy 

## 

##  'Positive '  CLass 

## 

#  Model  5:  bt_test_pred5 
confusionMatrix( as . numeric ( bt_test_LabeLs ) ,  as . numeric ( bt_test_pred5 ) ) 

##  Confusion  Matrix  and  Statistics 
## 

##  Reference 

##  Prediction  1  2 


## 

1 

30  0 

## 

2 

5  15 

## 

## 

Accuracy  :  0.9 

## 

95%  Cl  :  ( 0.7819 ,  0.9667; 

## 

No  Information  Rate  :  0.7 

## 

P-VaLue 

[Acc  >  NIR]  :  0.0007229 

## 

## 

Kappa  :  0.7826 

## 

Mcnemar 's 

Test  P-VaLue  :  0.0736383 

## 

:  0.82 

:  ( 0.6856 ,  0.9142) 
:  0.66 
:  0.009886 

:  0.6154 
:  0.504985 

:  0.8182 
:  0.8235 
:  0.9000 
:  0.7000 
:  0.6600 
:  0.5400 
:  0.6000 
:  0.8209 

:  1 
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## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


Sensitivity 
Specificity 
Pos  Pred  Value 
Neg  Pred  Value 
Prevalence 
Detection  Rate 
Detection  Prevalence 
Balanced  Accuracy 


0.8571 
1 . 0000 
1 . 0000 
0.7500 
0. 7000 
0. 6000 
0. 6000 
0.9286 


'Positive'  Class  :  1 


#  Model  11:  bt_test_predll 

confusionMatrix( as . numeric (bt_ tes t_Labels)j  as. numeric (bt_ test_predl 1 )) 


##  Confusion  Matrix  and  Statistics 


## 

##  Reference 


## 

Prediction  1  2 

## 

1  30  0 

## 

2  9  11 

## 

## 

Accuracy 

## 

95%  Cl 

## 

No  Information  Rate 

## 

P-Value  [Acc  >  NIR] 

## 

## 

Kappa 

## 

Mcnemar's  Test  P-Value 

## 

## 

Sensitivity 

## 

Specificity 

## 

Pos  Pred  Value 

## 

Neg  Pred  Value 

## 

Prevalence 

## 

Detection  Rate 

## 

Detection  Prevalence 

## 

Balanced  Accuracy 

## 

## 

'Positive'  Class 

## 

0.82 

( 0.6856 j  0.9142) 
0.78 

0.313048 

0.5946 

0.007661 

0.7692 
1 . 0000 
1 . 0000 
0.5500 
0.7800 
0. 6000 
0 . 6000 
0.8846 

1 


Finally,  we  can  use  a  3D  plot  to  display  the  results  of  model5  (mod5_TN, 
mod5_FN,  mod5_FP,  mod5_TP),  Fig.  7.3. 


#  install. packages ( "scatter plot 3d" ) 
library  ( scatterplot3d) 

grid_xy  <-  matrix(c(0j  1}  1}  0) }  nrow=2j  ncol=2) 

intensity  <-  matrix(c(mod5_TNj  mod5_FNj  mod5_FPj  mod5_TP)j  nrow=2j  ncol=2) 


#  scatterplot3d(grid_xyJ  intensity,  pch=16,  highlight . 3d=TRUE,  type="h", 
main="3D  Scatterplot" ) 


s3d.dat  <-  data. frame (cols=as. vector (col (grid_xy) ) j 
rows=as. vector ( row(grid_xy) ) , 
value=as. vector (intensity) ) 

scatterplot3d(s3d.  datj  pch=16j  highlight .  3d=TRUEj  type="h"j  xlab="real"j 
y Lab— " predicted "j  zlab=" Agreement" ,  main="3D  Scatterplot :  Model5  Results 
(FPj  FNj  TP  j  TN )") 
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3D  Scatterplot:  Models  Results  (FP,  FN,  TP,  TN) 
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Fig.  7.3  5-NN  classification  metrics 

#  scatterplot3d(s3d . dat,  type="h",  lwd=5,  pch="  ",  xlab="real" ,  ylab="predic 
ted",  zlab="Agreement" ,  main="Model5  Results  (FP,  FN,  TP,  TN)") 


7.4  Assignments:  7.  Lazy  Learning:  Classification  Using 
Nearest  Neighbors 

7.4.1  Traumatic  Brain  Injury  (TBI) 

Use  the  kNN  algorithm  to  provide  a  classification  of  the  data  in  the  TBI  case  study, 
(CaseStudyl  1_TBI).  Determine  an  appropriate  k ,  train,  and  evaluate  the  perfor¬ 
mance  of  the  classification  model  on  the  data.  Report  some  model  quality  statistics 
for  a  couple  of  different  values  of  k  and  use  these  to  rank-order  (and  perhaps  plot  the 
classification  results  of)  the  models. 


7.4.2  Parkinson ’s  Disease 

Use  05_PPMI_top_UPDRS_Integrated_LongFormatl  data  to  practice  KNN 
classification. 


References 


287 


7.4.3  KNN  Classification  in  a  High  Dimensional  Space 

•  Preprocess  the  data:  delete  the  index  and  ID  columns;  convert  the  response 
variable  ResearchGroup  to  binary  0-1  factor;  detect  NA  (missing)  values 
(impute  if  necessary) 

•  Summarize  the  dataset:  use  str,  summary,  cor,  ggpairs 

•  Scale/Normalize  the  data:  As  appropriate,  scale  to  [0,  1];  transform  log(x  +  1); 
discretize  (0  or  1),  etc. 

•  Partition  data  into  training  and  testing  sets:  use  set .  seed  and  random 
sample,  train: test  =  2:1 

•  Select  the  optimal  k  for  each  of  the  scaled  data:  Plot  an  error  graph  for  k , 
including  three  lines:  training_error,  cross-validation  error,  and  testing  error, 
respectively 

•  What  is  the  impact  of  k ;?  Formulate  a  hypothesis  about  the  relation  between 
k  and  the  error  rates.  You  can  try  to  use  knn.  tunning  to  verify  the  results 
(Hint:  select  the  same  folds,  all  you  may  obtain  a  result  slightly  different) 

•  Interpret  the  results:  Hint:  Considering  the  number  of  dimension  of  the  data, 
how  many  points  are  necessary  to  obtain  the  same  density  result  for  100  dimen¬ 
sional  space  compared  to  a  1  dimensional  space? 

•  Report  the  error  rates  for  both  the  training  and  the  testing  data.  What  do  you 
find? 


7.4.4  KNN  Classification  in  a  Lower  Dimensional  Space 

Try  all  the  above  again  but  select  only  the  variables: 

UPDRS_Part_I_Summary_Score_Baseline, 

UPDRS_Part_I_Summary_Score_Month_24, 

UPDRS_Part_II_Patient_Quest ionnaire_Summary_Score_Base- 
line, 

UPDRS_Part_II_Patient_Quest ionnaire_Summary_Score_Mont- 
h_24,  UPDRS_Part_III_Summary_Score_Baseline,  UPDRS_Part_ 
III_Summary_Score_Month_24,  as  predictors.  Now,  what  about  the  specific 
k  you  select  and  the  error  rates  for  each  kind  of  data  (original  data,  normalized  data, 
log-transformed  data,  and  binary  data).  Comment  on  any  interesting  observations. 
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Chapter  8 

Probabilistic  Learning:  Classification 
Using  Naive  Bayes 


® 

Check  for 
updates 


The  introduction  to  Chap.  7  presented  the  types  of  machine  learning  methods  and 
described  lazy  classification  for  numerical  data.  What  about  nominal  features  or 
textual  data?  In  this  Chapter,  we  will  begin  to  explore  some  classification  techniques 
for  categorical  data.  Specifically,  we  will  (1)  present  the  Naive  Bayes  algorithm; 
(2)  review  its  assumptions;  (3)  discuss  Laplace  estimation;  and  (4)  illustrate  the 
Naive  Bayesian  classifier  on  a  Head  and  Neck  Cancer  Medication  case-study. 

Later,  in  Chap.  20,  we  will  also  discuss  text  mining  and  natural  language 
processing  of  unstructured  text  data. 


8.1  Overview  of  the  Naive  Bayes  Algorithm 

Start  by  reviewing  the  basics  of  probability  theory  and  Bayesian  inference. 

Bayes  classifiers  use  training  data  to  calculate  an  observed  probability  of  each 
class  based  on  all  the  features.  The  probability  links  feature  values  to  classes  like  a 
map.  When  labeling  the  test  data,  we  utilize  the  feature  values  in  the  test  data  and  the 
“map”  to  classify  our  test  data  with  the  most  likely  class.  This  idea  seems  simple  but 
the  corresponding  algorithmic  implementations  might  be  very  sophisticated. 

The  best  scenario  of  accurately  estimating  the  probability  of  an  outcome-class 
map  is  when  all  features  in  Bayes  classifiers  attribute  to  the  class  simultaneously. 
The  Naive  Bayes  algorithm  is  frequently  used  for  text  classifications.  The  maximum 
a  posteriori  assignment  to  the  class  label  is  based  on  obtaining  the  conditional 
probability  density  function  for  each  feature  given  the  value  of  the  class  variable. 


©  Ivo  D.  Dinov  2018 

I.  D.  Dinov,  Data  Science  and  Predictive  Analytics, 
https://doi.org/10.1007/978-3-319-72347-l_8 


289 


290 


8  Probabilistic  Learning:  Classification  Using  Naive  Bayes 


8.2  Assumptions 

Naive  Bayes  is  named  for  its  “naive”  assumptions.  Its  most  important  assumption  is 
that  all  of  the  features  are  equally  important  and  independent.  This  rarely  happens  in 
real  world  data.  However,  sometimes  even  when  the  assumptions  are  violated,  Naive 
Bayes  still  performs  fairly  accurately,  particularly  when  the  number  of  features  p  is 
large.  This  is  why  the  Naive  Bayes  algorithm  may  be  used  as  a  powerful  text 
classifier. 

There  are  interesting  relations  between  QDA  (Quadratic  Discriminant  Analysis), 
LDA  (Linear  Discriminant  Analysis),  and  Naive  Bayes  classification.  Additional 
information  about  LDA  and  QDA  is  available  online  (http://wiki.socr.umich.edu/ 
index.php/SMHS_BigDataBigSci_CrossVal_LDA_QDA). 


8.3  Bayes  Formula 

Let’s  first  define  the  set-theoretic  Bayes  formula.  We  assume  that  B/ s  are  mutually 
exclusive  events,  for  all  i  =  1,2,  . . .,  n,  where  n  represents  the  number  of  features. 
If  A  and  B  are  two  events,  the  Bayes  conditional  probability  formula  is  as  follows: 

likelihood  x  Prior  Probability 

Posterior  Probability  = - - — — — - 

Marginal  Likelihood 

Symbolically, 


P(A\B ) 


P(B\A)P(A) 

m 


When  Bt  s  represent  a  partition  of  the  event  space,  S  =  U  Bt  and  Bt  D  Bj=  0  V  i  ^  j. 
So  we  have: 


P(A\B )  = 


P{B\A)  x  P(A) 


P(B\BX)  x  P(Bl)  +  P(B\B2)  x  P(B2)  •  •  .+P{B\Bn)  x  P(Bn) 


Now,  let’s  represent  the  Bayes  formula  in  terms  of  classification  using  observed 
features.  Having  observed  n  features,  Fh  for  each  of  K  possible  class  outcomes, 
Ck.  The  Bayesian  model  may  be  reformulate  to  make  it  more  tractable  using  the 
Bayes’  theorem,  by  decomposing  the  conditional  probability. 


P(Ck  \Fi,...,Fn) 


pF_ i)  •  •  •  ,Fn\Ck)P{Ck) 

P(F\,  ■  ..,Fn) 


8.3  Bayes  Formula 
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In  the  above  expression,  only  the  numerator  depends  on  the  class  label,  Ch  as  the 
values  of  the  features  Ft  are  observed  (or  imputed)  making  the  denominator  constant. 
Let’s  focus  on  the  numerator. 

The  numerator  essentially  represents  the  j  oint  probability  model: 


P(F i, . . .  ,Fn\Ck)P(Ck)  =  P(F i, . . .  ,Fn,  Ck) 


joint  model 


Repeatedly  using  the  chain  rule  and  the  definition  of  conditional  probability 
simplifies  this  to: 


P(Fu...,Fn,Ck)  =  P{Fi\F2,  ■  .  .  ,Fn,  Cjt)  x  P(F2,...,Fn,Ck)  = 

=  P(Fl\F2,...,Fn,Ck)  x  P(F2\F3,...,Fn,Ck)  x  P(F3, . . .  ,Fn,Ck)  = 
=  P{F  i\F2,  . . .  ,Fn,  Ck)  x  P(F2\F3,...,Fn,Ck)  x  P(F3\F4, . . . ,  Fn,  Ck) 
xP(F4,...,Fn,Ck)  = 


=  P(Fi\F2,  . . .  ,Fn,Ck)  x  P(F2\F3,...,Fn,Ck)  x  P(F3\F4, ...  ,Fn,  Ck)  x  ••• 
xP(Fn\Ck)  x  P(Ck) 

Note  that  the  “naive”  qualifier  in  the  Naive  Bayes  classifier  name  is  attributed  to 
the  oversimplification  of  the  conditional  probability.  Assuming  each  feature  Ft  is 
conditionally  statistical  independent  of  every  other  feature  Fj,  Vy  7^  i,  given  the 
category  Ck,  we  get: 


P(Fi\Fi+u  ■  ■■,Fn,Ck)=  P(Fi\Ck). 

This  reduces  the  joint  probability  model  to: 

P(Fu..-,Fn,Ck)=P(Fl\Ck)  xP(F2\Ck)  x  P(F3\Ck)  x  ■  ■  ■  x  P(Fn\Ck)  x  P(Ck) 

Therefore,  the  joint  model  is: 

n 

P(F  1,  •  •  •  ,Fn,  Ck)  =  P(Ck)  n  P{Fi\Ck) 

i—  1 

Essentially,  we  express  the  probability  of  class  level  L  given  an  observation, 
represented  as  a  set  of  independent  features  F1?  F2,  . . .,  Fn.  Then  the  posterior 
probability  that  the  observation  is  in  class  L  is  equal  to: 

p(cL)YV=lP(Fi\cL) 

n=i  p(f,)  ’ 
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where  the  denominator,  F(/q-),  is  a  scaling  factor  that  represents  the 

marginal  probability  of  observing  all  features  jointly. 

For  a  given  case  X  =  (F1?F2, . .  i.e.,  given  vector  of  features,  the  naive 

Bayes  classifier  assigns  the  most  likely  class  C  by  calculating  ^ 

A  /I 

for  all  class  labels  L,  and  then  assigning  the  class  C  corresponding  to  the  maximum 
posterior  probability.  Analytically,  C  is  defined  by: 


C 


arg  max 

L 


p(CL)ULP(Fi\cL ) 

nr=i  p(p,) 


As  the  denominator  is  static  for  L,  the  posterior  probability  above  is  maximized 
when  the  numerator  is  maximized,  i.e.,  C  =  argmaxLP(CL)  JJ  P{Fi\Ci). 

The  contingency  table  below  illustrates  schematically  how  the  Bayesian,  mar¬ 
ginal,  conditional,  and  joint  probabilities  may  be  calculated  for  a  finite  number  of 
features  (columns)  and  classes  (rows). 


Features/ 

Classes 

Ft 

f2 

F„ 

Total 

Ci 

•  •  . 

•  .  • 

•  •  • 

•  .  . 

Marginal  P(C\) 

c2 

.  .  . 

.  •  . 

.  •  . 

Joint  P(C2,  Fn) 

.  •  . 

•  •  • 

.  .  • 

.  •  . 

.  •  . 

.  .  . 

•  •  . 

CL 

Conditional 

p(  77  1  r'  \  _ P{F\,Cl) 

nt  i  |  CLj  -  p{Cl) 

Total 

Marginal  P(F2) 

.  .  . 

.  .  . 

N 

In  the  DSPA  Appendix,  we  provide  additional  technical  details,  code,  and 
applications  of  Bayesian  simulation,  modeling  and  inference. 


8.4  The  Laplace  Estimator 

If  at  least  one  P(Ft\  CL )  =  0,  then  P(CL I  F1? . . .,  Fn )  =  0,  which  means  the  probability 
of  being  in  this  class  is  zero.  However,  P{Ft I  CL)  =  0  could  be  result  from  a  random 
chance  in  picking  the  training  data. 

One  of  the  solutions  to  this  scenario  is  Laplace  estimation,  also  known  as 
Laplace  smoothing,  which  can  be  accomplished  in  two  ways.  One  is  to  add  small 
number  to  each  cell  in  the  frequency  table,  which  allows  each  class-feature  combi¬ 
nation  to  be  at  least  one  in  the  training  data.  Then  P(Ft I  CL )  >  0  for  all  i.  Another 
strategy  is  to  add  some  small  value,  e,  to  the  numerator  and  denominator  when 
calculating  the  posterior  probability.  Note  that  these  small  perturbations  of  the 
denominator  should  be  larger  than  the  changes  in  the  numerator  to  avoid  trivial 
(0)  posterior  for  another  class. 


8.5  Case  Study:  Head  and  Neck  Cancer  Medication 
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8.5  Case  Study:  Head  and  Neck  Cancer  Medication 
8.5.1  Step  1:  Collecting  Data 

We  utilize  the  Inpatient  Head  and  Neck  Cancer  Medication  data  for  this  case  study, 

which  is  the  case  study  14  in  our  data  archive. 

Variables: 

•  PID:  coded  patient  ID. 

•  ENC_ID:  coded  encounter  ID. 

•  Seer_stage:  SEER  cancer  stage  (0  =  In  situ,  1  =  Localized,  2  =  Regional  by 
direct  extension,  3  =  Regional  to  lymph  nodes,  4  =  Regional  (both  codes  2 
and  3),  5  =  Regional,  NOS,  7  =  Distant  metastases/systemic  disease,  8  =  Not 
applicable,  9  =  Unstaged,  unknown,  or  unspecified).  See:  http://seer.cancer.gov/ 
tools/ssm. 

•  Medication_desc:  description  of  the  chemical  composition  of  the  medication. 

•  Medication_summary:  brief  description  about  medication  brand  and  usage. 

•  Dose:  the  dosage  in  the  medication  summary. 

•  Unit:  the  unit  for  dosage  in  the  Medication_summary. 

•  Frequency:  the  frequency  of  use  in  the  Medication_summary. 

•  Total_dose_count:  total  dosage  count  according  to  the  Medication_summary. 


8.5.2  Step  2:  Exploring  and  Preparing  the  Data 

Let’s  load  our  data  first. 


hn_med<-read. csv (" https :// umich . instructure. com/ fiLes/1614350/ down  Load? down L 

oad_frd=l"j  stringsAsFactors  =  FALSE) 

str(hn_med) 


##  'data. frame' : 

##  $  PID 

5  10136  10143  . . . 

##  $  ENC_ID 

9  47744  47769  .  .  . 

##  $  seer_stage 

##  $  MEDICATION  DESC 


662  obs.  of  9  variabLes : 

:  int  10000  10008  10029  10063  10071  10103  1012  1013 
:  int  46836  46886  47034  47240  47276  47511  3138  4773 


:  int  1141911191... 

:  chr  "ranitidine"  "heparin  injection"  "ampiciLLin/ 
suLbactam  IVPB  Uhl"  "fentaNYL  injection  Uhl"  .  .  . 

##  $  MEDICATION_SUMMARY :  chr  "(Zantac)  150  mg  tabLet  oraL  two  times  a  day" 

"5j000  unit  subcutaneous  three  times  a  day"  "(Unasyn)  15  g  IV  every  6  hours" 
"25  -  50  microgram  IV  every  5  minutes  PRN  severe  pain\nMaximum  dose  200  meg 
Per  PACU  protocoL"  ... 

:  chr  "150"  "5000"  "1.5"  "50"  ... 

:  chr  "mg"  "unit"  "g"  "microgram"  ... 

:  chr  "two  times  a  day"  "three  times  a  day"  "every 
minutes"  ... 


##  $  DOSE 

##  $  UNIT 

##  $  FREQUENCY 

6  hours"  "every 


##  $  TOTAL  DOSE  COUNT  :  int  53  11  21226  15  1 
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Change  the  seer_stage  (cancer  stage  indicator)  variable  into  a  factor. 

hn_med$seer_stage  <-  factor (hn_med$seer_stage) 
str(hn_med$seer_stage) 

##  Factor  w/  9  Levels  "0" ,  "1",  "2",  "3",  2252922292  ... 

table  (lnn_med$seer_stage) 

## 

##  012345789 

##  21  265  53  90  46  18  87  14  68 


Data  Preparation:  Processing  Text  Data  for  Analysis 

As  you  can  see,  the  medication_summary  contains  a  great  amount  of  text.  We 
should  do  some  text  mining  to  prepare  the  data  for  analysis.  In  R,  the  tm  package  is  a 
good  choice  for  text  mining. 

#  install,  packages  ("tm".,  repos  =  "http://cran.us.r-project.org") 

#  requires  R  V.3.3.1  + 

The  first  step  for  text  mining  is  to  convert  text  features  (text  elements)  into  a 
corpus  object,  which  is  a  collection  of  text  documents. 

hn_med_  corpus < -  Corpus (  VectorSource ( hn_med$MEDICATION_SUMMARY)  ) 
print (hn_med_corpus ) 

After  we  construct  the  corpus  object,  we  could  see  that  we  have  662  documents. 
Each  document  represents  an  encounter  (e.g.,  notes  on  medical  treatment)  for  a  patient. 

inspect ( hn_med_corpus [ 1 : 3] ) 

##  <<SimpieCorpus>> 

##  Metadata:  corpus  specific:  1}  document  Level  (indexed):  0 
##  Content:  documents :  3 
## 

##  [1]  (Zantac)  150  mg  tablet  oral  two  times  a  day 
##  [2]  5j000  unit  subcutaneous  three  times  a  day 
##  [3]  (Unasyn)  15  g  IV  every  6  hours 

hn_med_corpus[ [1] ]$content 

##  [1]  "(Zantac)  150  mg  tablet  oral  two  times  a  day" 
hn_med_corpus[ [2] ]$content 

##  [1]  "5j000  unit  subcutaneous  three  times  a  day" 

hn_med_corpus[ [3] ]$content 

##  [1]  "(Unasyn)  15  g  IV  every  6  hours" 
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There  are  unwanted  punctuation  and  other  symbols  in  the  corpus  document  that 
we  want  to  remove.  We  use  the  tm :  :  tm  map  ( )  function  for  the  cleaning. 

corpus_cLean  <-  tm_map(hn_med_corpusJ  to  Lower) 
corpus_ciean  <-  tm_map ( cor pus_c Lean }  removePunctuation) 
corpus_cLean  <-  tm_map ( cor pus_c Lean ,  striplA/hitespace) 
corpus_cLean  <-  tm_map ( cor pus_c Lean j  removeNumbers) 

#  corpus_clean  <-  tm_map(corpus_cleanJ  PlainTextDocument) 

The  above  lines  of  code  changed  all  the  characters  to  lower  case,  removed  all 
punctuations  and  extra  white  spaces  (typically  created  by  deleting  punctuations),  and 
removed  numbers  (we  could  also  convert  the  corpus  to  plain  text). 

inspect ( corpus_cLean[l :3] ) 

##  <<SimpLeCorpus>> 

##  Metadata:  corpus  specific:  1,  document  LeveL  (indexed):  0 
##  Content:  documents :  3 
## 

##  [1]  zantac  mg  tab  Let  oraL  two  times  a  day 
##  [2]  unit  subcutaneous  three  times  a  day 
##  [3]  unasyn  g  iv  every  hours 

corpus_cLean[ [1] ]$content 

##  [1]  "zantac  mg  tabLet  oraL  two  times  a  day" 
corpus_cLean[ [2] ]$content 

##  [1]  "  unit  subcutaneous  three  times  a  day" 

corpus_cLean[ [3] ]$content 

##  [1]  "unasyn  g  iv  every  hours" 

The  DocumentTermMatrix  ( )  function  can  tokenize  the  medication  sum¬ 
mary  into  words.  It  can  count  frequent  terms  in  each  document  in  the  corpus  object. 


hn_med_dtm<- DocumentTermMatrix ( corpus_cLean ) 


Data  Preparation:  Creating  Training  and  Test  Datasets 

Just  like  in  Chap.  7,  we  need  to  separate  the  dataset  into  training  and  test  subsets.  We 
have  to  subset  the  raw  data  with  other  features,  the  corpus  object  and  the  document 
term  matrix. 
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set.seed(12) 

#  80%  training  +  20%  testing 

subset_int  <-  sample (nrow(hn_med) ,  floor (nrow(hn_med)*0. 8) ) 

hn_med_train< -hn_med [ subset_intj  ] 
hn_med_test<-hn_med[ -subset_intj  ] 
hn_med_dtm_train<-hn_med_dtm[subset_intj  ] 
hn_med_dtm_test<-hn_med_dtm[-subset_intj  ] 
corpus_train<-corpus_clean[subset_int] 
corpus_test<-corpus_clean[ -subset_int] 

#  hn_med_train<-hn_med[l: 562,  ] 

#hn_med_test<-hn_med[563:662,  ] 

#  hn_med_dtm_train<-hn_med_dtm[l : 562,  ] 

#  hn_med_dtm_test<-hn_med_dtm[563 : 662,  ] 

#corpus_train<-corpus_clean[l : 562] 

#corpus_test<-corpus_clean[563 : 662 ] 

Let’s  examine  the  distribution  of  seer  stages  in  the  training  and  test  datasets. 

prop. table ( table (hn_med_train$seer_stage) ) 

## 

##  0  1  2  3  4  5 

##  0.03024575  0.38374291  0.08317580  0.14555766  0.06616257  0.03402647 

##789 
##  0.13421550  0.02268431  0.10018904 

prop. table (table (hn_med_test$seer_stage) ) 

## 

##  0  1  2  3  4  5 

##  0.03759398  0.46616541  0.06766917  0.09774436  0.08270677  0.00000000 

##789 
##  0.12030075  0.01503759  0.11278195 

We  can  separate  (dichotomize)  the  seer_stage  into  two  categories: 

#  No  stage  or  early  stage  cancer,  and 

#  later  stage  cancer. 


8.5  Case  Study:  Head  and  Neck  Cancer  Medication 
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hn_med_train$stage< -hn_med_train$seer_stage  %in%  c(4j  5,  7) 
hn_med_train$stage< -factor (hn_med_tra\n$stagej  Levels=c(Fj  T),  Labels  =  c(''e 
arLy_stage  ",  " Later_stage  ") ) 

hn_med_test$stage<-hn_med_test$seer_stage  %in%  c(4}  5,  7) 

hn_med_test$stage< -factor (!nn_med_test$stageJ  Levels=c(F ,  T),  Labels  =  c("ear 

Ly_stage"j  "  Later_stage"  )) 

prop. table ( table (hn_med_train$stage)  ) 

##  early_stage  Later_stage 
##  0 . 7655955  0 . 2344045 

prop. table (table (hn_med_test$stage) ) 

##  early_stage  later_stage 
##  0 . 7969925  0 . 2030075 


Visualizing  Text  Data:  Word  Clouds 

A  word  cloud  can  help  us  visualize  text  data.  More  frequent  words  would  have  larger 
fonts  in  the  figure,  while  less  common  words  appear  in  smaller  fonts.  There  is  a 
wordcloud  package  in  R  that  is  commonly  used  for  creating  these  figures 
(Figs.  8.1,  8.2,  8.3). 


#  install. packages("wordcloud",  repos  =  "http://cran.us.r-project.org") 
l ibrary ( wordc Loud) 

wordcloud(corpus_trairij  min.freq  =  40 ,  random,  order  =  FALSE) 

The  random.  order=FALSE  option  made  more  frequent  words  appear  in  the 
middle.  min.freq=4  0  option  sets  the  cutoff  word  frequency  to  be  at  least 
40  times  in  the  corpus  object.  Therefore,  the  words  must  be  appear  in  at  least 
40  medication  summaries  to  be  shown  on  the  graph. 

We  can  also  visualize  the  difference  between  early  stages  and  later  stages  using 
this  type  of  graph  (Figs.  8.2  and  8.3). 


Fig.  8.1  A  wordle  diagram 
representing  the  common 
terms  (frequency  exceeding 
40)  in  the  head  and  neck 
(H&N)  text  corpus 
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Fig.  8.2  Wordle  plot  of  the 
common  tersm  included  in 
the  medical  treatment 
summary  corpus  of  the 
early  stage  cancer  patients 


maximum 

oain 

p  ^tablet 

pacu  subcutaneous 
severe  jLf°T  protocol 

t>  oral 

hoursPrngive 

blood  minutes 


every 


Fig.  8.3  Wordle  plot  of  the 
common  terms  included  in 
the  medical  treatment 
summary  corpus  of  the  later 
stage  cancer  patients 
(compare  to  Fig.  8.2) 
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ear Ly< -subset (hn_med_trainj  stage=="earLy_stage" ) 

Later< -subset (hn_med_trairij  stage==" Later_stage" ) 
wordcLoud(earLy$MEDICATION_SUMMARYj  max.  worc/s  =  20) 
wordc Loud ( Later$MEDICATION_SUMMARYj  max. words  =  20) 

We  can  see  that  the  frequent  words  are  somewhat  different  in  the  medication 
summary  between  early  stage  and  later  stage  patients. 


Data  Preparation:  Creating  Indicator  Features  for  Frequent  Words 

For  simplicity,  we  utilize  the  medication  summary  as  the  only  feature  to  classify 
cancer  stages.  You  may  recall  that  in  Chap.  7  we  used  features  for  classifications. 
In  this  study,  we  are  going  to  make  the  frequencies  of  words  into  features. 


8.5  Case  Study:  Head  and  Neck  Cancer  Medication 
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summary (findFreqTerms( hn_med_dtm_t rain j  5)) 

##  Length  CLass  Mode 

##  103  character  character 

hn_med_dict  < -as .character (findFreqTerms(hn_med_dtm_trainj  5)) 
hn_train  < -DocumentTermMatrix(corpus_train ,  List (dictionary =hn_med_dict) ) 
hn_test  <-DocumentTermMatrix(corpus_testj  List (dictionary =hn_med_dict) ) 

The  above  code  limits  the  document  term  matrix  with  words  that  have  appeared  in 
at  least  five  different  documents.  This  created  103  features  for  us  to  use. 

The  Naive  Bayes  classifier  trains  on  data  with  categorical  features.  Thus,  we  need 
to  transform  our  word  count  features  into  categorical  data.  A  way  to  do  this  is  to 
change  the  count  into  an  indicator  of  whether  this  word  appears.  We  can  create  a 
function  of  our  own  to  deal  with  this. 


convert_counts  <-  function(x)  { 
x  <-  if  else (x  >  0,  1,  0) 

x  <-  f actor (Xj  Levels  =  c(0,  1) ,  LabeLs  =  c("No"j  "Yes")) 
return (x) 


An  important  statement  isx<-ifelse(x>0,  1 ,  0).  This  is  saying  that  if  we 
have  an  x  that  is  greater  than  0,  we  assign  value  1  to  it,  otherwise  the  value  is  set  to  0. 

Now  let’s  apply  our  own  function  convert_counts  ( )  on  each  column 
(MARGIN=2)  of  the  training  and  testing  datasets. 

hn_train  <-  appLy(hn_trainj  MARGIN  =  2,  convert_counts) 
hn_test  <-  apply (hn_testj  MARGIN  =  2 ,  convert_counts) 

So  far,  we  successfully  created  indicators  for  words  that  appeared  at  least  in  five 
different  documents  in  the  training  data. 


8.5.3  Step  3:  Training  a  Model  on  the  Data 

The  package  we  will  use  for  Naive  Bayes  classifier  is  called  el 0  71. 

•  install. packages("el071" ,  repos  =  "http://cran.us.r-project.org") 

Library (el071 ) 

The  function  Naive  Bayes  ( )  has  following  components: 

m<-naiveBayes(train,  class,  laplace=0) 

•  train:  data  frame  containing  numeric  training  data  (features) 

•  class:  factor  vector  with  the  class  for  each  row  in  the  training  data. 

•  laplace:  positive  double  controlling  Laplace  smoothing;  default  is  zero  and  dis¬ 
ables  Laplace  smoothing. 
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Let’s  build  our  classifier  first. 

hn_cLassifier  <-  naiveBayes(hn_trairij  hn_med_train$stoge) 

Then,  we  can  use  the  classifier  to  make  predictions  using  predict  ( ) .  Recall 
that  when  we  presented  the  AdaBoost  example  in  Chap.  3,  we  saw  the  basic 
mechanism  of  machine-learning  training,  prediction  and  assessment. 

The  function  predict  ( )  has  the  following  components: 

p<-predict(m,  test,  type="class" ) 


•  m:  classifier  trained  by  NaiveBayes  ( ) 

•  test:  test  data  frame  or  matrix 

•  type:  either  "class "  or  " raw"  specifies  whether  the  predictions  should  be  the 
most  likely  class  value  or  the  raw  predicted  probabilities. 

hn_ tes t_pred< - predict ( hn_ c L assifierj  hn_ tes t ) 


8.5.4  Step  4:  Evaluating  Model  Performance 


Similarly  to  the  approach  in  Chap.  7,  we  use  cross  table  to  compare  predicted  class 
and  the  true  class  of  our  test  dataset. 


L ibrary ( gmode is) 

CrossTab Le( lnn_ tes t_predj  hn_med_ tes t$s tage ) 

##  CeLL  Contents 

##  /  N  I 

##  /  Chi-square  contribution  / 

##  /  N  /  Row  Total  / 

##  /  N  /  CoL  Total  / 

##  /  N  /  Table  Total  / 


/  / 

##  Total  Observations  in  Table:  133 

##  /  hn_med_test$stage 

##  hn_test_pred  /  early_stage  /  Later_stage  / 

flow  Total  1 

## 

- /■ 

- /- 

- /- 

- / 

## 

early  stage  / 

90  / 

24  / 

114  / 

## 

/ 

0.000  / 

0.032  / 

/ 

## 

/ 

0.709  / 

0.211  / 

0.057  / 

## 

/ 

0.049  / 

0.009  / 

/ 

## 

/ 

0.677  / 

0.100  / 

/ 

## 

- /■ 

- /- 

- /- 

- / 

## 

later  stage  / 

16  / 

3  / 

19  / 

## 

/ 

0.049  / 

0.190  / 

/ 

## 

/ 

0.042  / 

0.150  / 

0.143  / 

## 

/ 

0.151  / 

0.111  / 

/ 

## 

/ 

0.120  / 

0.023  / 

/ 

## 

- /■ 

- /- 

- /- 

- / 

## 

Column  Total  / 

106  / 

27  / 

133  / 

## 

/ 

0.797  / 

0.203  / 

/ 

8.5  Case  Study:  Head  and  Neck  Cancer  Medication 
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It  may  be  worth  skipping  forward  to  Chap.  14,  where  we  present  a  summary  table 
for  the  key  measures  used  to  evaluate  the  performance  of  binary  tests,  classifiers,  or 
predictions. 

The  prediction  accuracy: 


ACC  = 


TP+TN 

TP +  FP  +  FN  +  TN 


93 

133 


=  0.7. 


From  the  cross  table  we  can  see  that  our  prediction  accuracy  is  ^  =  0.70. 
However,  the  later  stage  classification  only  has  three  counts.  This  might  be  due  to 
the  P(Fi I  CL)  ~  0  problem  that  we  discussed  above. 


8.5.5  Step  5:  Improving  Model  Performance 


After  setting  laplace=15,  the  accuracy  goes  up  to  76%.  Although  this  is  a  small 
improvement  in  terms  of  accuracy,  we  have  a  better  chance  of  detecting  later  stage 
patients. 


hn_cLassifier  <-  naiveBayes(hn_trairij  hn_med_troin$stageJ  LopLace  =  15) 
hn_ tes t_pred< - predict ( hn_ c L assifierj  hn_ tes t ) 

CrossTab Le( hn_ tes t_predj  hn_med_ tes t$s tage ) 


##  Ceil  Contents 


## 

/■ 

■i 

## 

/ 

N 

i 

## 

/ 

Chi ■ 

-square  contribution 

i 

## 

/ 

N  /  Row  Total 

i 

## 

/ 

N  /  Col  Total 

i 

## 

/ 

N  /  Table  Total 

i 

## 

/■ 

■i 

## 

##  Total  Observations  in  Table:  133 
## 

##  /  hn_med_test$stage 

##  hn_test_pred  /  early_stage  /  Later_stage  /  Row  Total  / 


## 

/  ' 

early  stage  / 

/ 

99  / 

/ 

25  / 

/ 

124  / 

## 

/ 

0.000  / 

0.001  / 

/ 

## 

/ 

0.790  / 

0.202  / 

0.932  / 

## 

/ 

0.934  / 

0.926  / 

/ 

## 

/ 

0.744  / 

0.188  / 

/ 

## 

- /„. 

- /— . 

- /-- 

- / 

## 

Later  stage  / 

7  / 

2  / 

9  / 

## 

/ 

0.004  / 

0.016  / 

/ 

## 

/ 

0.778  / 

0.222  / 

0.068  / 

## 

/ 

0.066  / 

0.074  / 

/ 

## 

/ 

0.053  / 

0.015  / 

/ 

## 

- /„. 

- /— . 

- /-- 

- / 

## 

Column  Total  / 

106  / 

27  / 

133  / 

## 

/ 

0.797  / 

0.203  / 

/ 

## 

- /„. 

- /— . 

- /— - 

- / 
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8.5.6  Step  6:  Compare  Naive  Bayesian  against  LDA 


As  mentioned  earlier,  Naive  Bayes  with  normality  assumption  is  a  special  case  of 
Discriminant  Analysis.  It  might  be  interesting  to  compare  the  results  against  LDA. 


Library (MASS) 

df_hn_train  =  data. frame (LappLy (as. data. frame (hn_t rain) , as. numeric )>  stage  = 
hn_med_train$stage ) 

df_hn_test  =  data. frame (LappLy (as. data. frame (hn_test)j as. numeric )j  stage  =  h 
n_med_test$stage ) 

hn_Lda  <-  Lda(data=df_hn_trainj  stage ~.) 

#  hn_pred  =  predict (hn_lda,  df_hn_test[ , -104] ) 
hn_pred  =  predict (hn_Lda}  df_hn_test) 

CrossTab Le( hn_pred$ class j  df_hn_ tes t$ stage ) 


##  Cell  Contents 


## 

/■ 

-i 

## 

/ 

N 

i 

## 

/ 

Chi- 

-square  contribution 

i 

## 

/ 

N  /  Row  Total 

i 

## 

/ 

N  /  Col  Total 

i 

## 

/ 

N  /  Table  Total 

i 

## 

/■ 

-i 

## 

##  Total  Observations  in  Table:  133 
## 

##  /  df_hn_test$stage 

##  hn_pred$class  /  early_stage  /  Later_stage  /  Row  Total  / 


## 

/— 

- /— . 

- / — 

- / 

## 

early  stage  / 

66  / 

22  / 

88  / 

## 

/ 

0.244  / 

0.957  / 

/ 

## 

/ 

0.750  / 

0.250  / 

0.662  / 

## 

/ 

0.623  / 

0.815  / 

/ 

## 

/ 

0.496  / 

0.165  / 

/ 

## 

- /— 

- /— . 

- / — 

- / 

## 

Later  stage  / 

40  / 

5  / 

45  / 

## 

/ 

0.477  / 

1.872  / 

/ 

## 

/ 

0. 889  / 

0.111  / 

0.338  / 

## 

/ 

0.377  / 

0.185  / 

/ 

## 

/ 

0.301  / 

0.038  / 

/ 

## 

- /— 

- /— . 

- / — 

- / 

## 

Column  Total  / 

106  / 

27  / 

133  / 

## 

/ 

0.797  / 

0.203  / 

/ 

## 

- /— 

- /— . 

- / — 

- / 

Here,  Naive  Bayes  outperforms  LDA  classifier  in  terms  of  the  overall  accuracy. 
However,  LDA  has  a  lower  type  II  error  (^) ,  which  is  clinically  important  in  order 
to  avoid  misdiagnosing  later-stage  cancer  patients  as  early  stage. 

In  later  chapters,  we  will  step  deeper  into  the  space  of  classification  problems  and 
see  more  sophisticated  approaches. 


8.6  Practice  Problem 
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8.6  Practice  Problem 


In  the  previous  case  study,  we  classified  the  patients  with  seer_stage  of 
“not  applicable”(seer_stage=8)  and  “unstaged,  unknown  or 
unspecified”(seer_stage=9)  as  no  cancer  or  early  cancer  stages.  Let’s  remove 
these  two  categories  and  replicate  the  Naive  Bayes  classifier  case  study  again. 


hn_medl<-hn_med[ !hn_med$seer_stage  %in%  c(8j  9 )}  ] 
str(hn_medl) ;  dim(hn_medl) 

##  'data. frame' :  580  obs.  of  9  variables : 

##  $  PID  :  int  10000  10008  10029  10063  10103  1012  10135  10143 

10152  10184  . . . 

##  $  ENC_ID  :  int  46836  46886  47034  47240  47511  3138  47739  47769 

47800  47938  . . . 

##  $  seer_stage  :  Factor  w/  9  Levels  "0"J"1"J"2"J"3"J..:  2  2  5  2  2  2 

2272... 

##  $  MEDICATION_DESC  :  chr  "ranitidine"  "heparin  injection"  "ampicillin/ 

sulbactam  IVPB  UH"  "fentaNYL  injection  UH"  . . . 

##  $  MEDICATION_SUMMARY :  chr  "(Zantac)  150  mg  tablet  oral  two  times  a  day" 

"5j000  unit  subcutaneous  three  times  a  day"  "(Unasyn)  15  g  IV  every  6  hours" 
"25  -  50  microgram  IV  every  5  minutes  PRN  severe  pain\nMaximum  dose  200  meg 
Per  PACU  protocol"  ... 

##  $  DOSE  :  chr  "150"  "5000"  "1.5"  "50"  ... 

##  $  UNIT  :  chr  "mg"  "unit"  "g"  "microgram"  ... 

##  $  FREQUENCY  :  chr  "two  times  a  day"  "three  times  a  day"  "every 

6  hours"  "every  5  minutes"  ... 

##  $  TOTAL  DOSE  COUNT  :  int  53  11  22261  24  2... 


##  [1]  580  9 


Now  we  have  only  580  observations.  We  can  either  use  the  first  480  of  them  as 
the  training  dataset  and  the  last  100  as  the  test  dataset,  or  select  80-20  (training¬ 
testing)  split,  and  evaluate  the  prediction  accuracy  when  laplace=l? 

We  can  use  the  same  code  for  creating  the  classes  in  training  and  test  dataset. 
Since  the  seer_stage=8  or  9  is  not  in  the  data,  we  classify  seer_stage=0  , 
1 ,  2  or  3  as  “early _stage”  and  seer_stage=4,  5  or  7  as  “later_stage”. 

hn_med_trainl$stage<-hn_med_trainl$seer_stage  %in%  c(4 ,  5,  7) 
hn_med_trainl$stage<- factor (hn_med_trainl$stage}  levels=c(F ,  T),  labels  =  c( 
" early _stage " }  " later_stage ") ) 

hn_med_testl$stage<-hn_med_testl$seer_stage  %in%  c(4 ,  5,  7) 
hn_med_testl$stage< -factor (hn_med_testl$stagej  levels=c(F }  T),  Labels  =  c("e 
arly_stage"j  " Later _stage" ) ) 
prop . table (table (hn_med_trainl$stage) ) 

## 

##  early _stage  Later _stage 
##  0 . 7392241  0 . 2607759 

prop. table ( table (hn_med_testl$stage) ) 

## 

##  early _stage  Later _stage 
##  0.  7413793  0. 2586207 
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Use  terms  that  have  appeared  in  five  or  more  documents  in  the  training  dataset  to 
build  the  document  term  matrix. 


##  Length  Class  Mode 

##  112  character  character 

##  Cell  Contents 

##  I - / 

##  /  N  I 

##  /  Chi-square  contribution  / 

##  /  N  /  Row  Total  / 

##  /  N  /  Col  Total  I 

##  /  N  /  Table  Total  / 

##  / - / 

## 

## 


##  Total  Observations  in  Table:  116 
## 

## 

##  /  hn_med_testl$stage 

##  hn_test_predl  /  early_stage  /  Later_stage  /  Row  Total  / 


## 

/ 

early  stage  / 

/ 

84  / 

/ 

28  / 

/ 

112  / 

## 

/ 

0.011  / 

0.032  / 

/ 

## 

/ 

0.750  / 

0.250  / 

0.966  / 

## 

/ 

0.977  / 

0.933  / 

/ 

## 

/ 

0.724  / 

0.241  / 

/ 

## 

- /—. 

- /—. 

- /—. 

- / 

## 

tczter  stage  / 

2  / 

2  / 

4  / 

## 

/ 

0.314  / 

0.901  / 

/ 

## 

/ 

0.500  / 

0.500  / 

0.034  / 

## 

/ 

0.023  / 

0.067  / 

/ 

## 

/ 

0.017  / 

0.017  / 

/ 

## 

- /— - 

- /-- 

- /—. 

- / 

## 

Column  Total  / 

86  / 

30  / 

116  / 

## 

/ 

0.741  / 

0.259  / 

/ 

## 

- /—. 

- /—. 

- /—. 

- / 

TP  +  77V 

— _ _ 

7P  +  FP  +  FW  +  77V 


86 

116 


0.74. 


Try  to  reproduce  these  results  with  some  new  data  from  the  list  of  our  Case- 
Studies. 


8.7  Assignments  8:  Probabilistic  Learning:  Classification 
Using  Naive  Bayes 

8.7.1  Explain  These  Two  Concepts 

•  Bayes  Theorem 

•  Laplace  Estimation 


References 
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8.7.2  Analyzing  Textual  Data 

Load  the  SOCR  2011  US  Job  Satisfaction  data.  The  last  column  (Description) 

contains  free  text  about  each  job.  Notice  that  spaces  are  replaced  by  underscores, _ . 

Mine  the  text  field  and  suggest  some  the  meta-data  analytics. 

•  Convert  the  textual  meta-data  into  a  corpus  object. 

•  Triage  some  of  the  irrelevant  punctuation  and  other  symbols  in  the  corpus 
document,  change  all  text  to  lower  case,  etc. 

•  Tokenize  the  job  descriptions  into  words.  Examine  the  distributions  of 
Stress_Category  and  Hiring_Potential. 

•  Classify  the  job  stress  into  two  categories. 

•  Generate  a  word  cloud  to  visualize  the  job  description  text. 

•  Graphically  visualize  the  difference  between  low  and  high  Stress_Category 
graph. 

•  Transform  the  word  count  features  into  categorical  data 

•  Ignore  those  low  frequency  words  and  report  the  sparsity  of  your  categorical  data 
matrix  with  or  without  delete  those  low  frequency  words. 

•  Apply  the  Naive  Bayes  classifier  to  original  matrix  and  lower  dimension  matrix. 
What  do  you  observe? 

•  Apply  and  compare  LDA  and  Naive  Bayes  classifiers  with  respect  to  the  error, 
specificity  and  sensitivity. 
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Chapter  9 

Decision  Tree  Divide  and  Conquer 
Classification 


® 

Check  for 
updates 


When  classification  needs  to  be  apparent,  kNN  or  naive  Bayes  we  presented  earlier 
may  not  be  useful  as  they  do  not  generate  explicit  classification  rules.  In  some  cases, 
we  need  to  specify  well  stated  rules  for  our  decisions,  just  like  a  scoring  criterion  for 
driving  ability  or  credit  scoring  for  loan  underwriting.  The  decisions  in  many 
situations  actually  require  having  a  clear  and  easily  understandable  decision  tree  to 
follow  the  classification  process  start  to  finish. 

In  this  chapter,  we  will  (1)  see  a  simple  motivational  example  of  decision  trees 
based  on  the  Iris  data;  (2)  describe  decision-tree  divide  and  conquer  methods; 
(3)  examine  certain  measures  quantifying  classification  accuracy;  (4)  show  strategies 
for  pruning  decision  trees;  (5)  work  through  a  Quality  of  Life  in  Chronic  Disease 
case-study;  and  (6)  review  the  One  Rule  and  RIPPER  algorithms. 


9.1  Motivation 

Decision  tree  learners  enable  classification  via  tree  structures  modeling  the  relation¬ 
ships  among  all  features  and  potential  outcomes  in  the  data.  All  decision  trees  begin 
with  a  trunk  (all  data  are  part  of  the  same  cohort),  which  is  then  split  into  narrower 
and  narrower  branches  by  forking  decisions  based  on  the  intrinsic  data  structure.  At 
each  step,  splitting  the  data  into  branches  may  include  binary  or  multinomial 
classification.  The  final  decision  is  obtained  when  the  tree  branching  process  termi¬ 
nates.  The  terminal  (leaf)  nodes  represent  the  action  to  be  taken  as  the  result  of  the 
series  of  branching  decisions.  For  predictive  models,  the  leaf  nodes  provide  the 
expected  forecasting  results  given  the  series  of  events  in  the  tree. 

There  are  a  number  of  R  packages  available  for  decision  tree  classification 
including  rpart,  C5 . 0,  party,  etc. 
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9.2  Hands-on  Example:  Iris  Data 


Let’s  start  by  seeing  a  simple  example  using  the  Ms  dataset,  which  we  saw  in  Chap.  3. 
The  data  features  or  attributes  include  Sepal .  Length,  Sepal .  Width,  Petal . 
Length,  and  Petal .  Width,  and  classes  are  represented  by  the  Species  taxa 
(setosa;  versicolor;  and  virginica). 


##  ' data .frame ' : 

##  $  SepaL . Length : 

##  $  SepaL .Width  : 

##  $  PetaL.  Length: 

##  $  PetaL .Width  : 

##  $  Species 

1111  ... 


150  obs.  of  5  variabLes : 

num  5.1  4.9  4.7  4.6  5  5.4  4.6  5  4.4  4.9  ... 

num  3.5  3  3.2  3.1  3.6  3.9  3.4  3.4  2.9  3.1  ... 

num  1.4  1.4  1.3  1.5  1.4  1.7  1.4  1.5  1.4  1.5  ... 

num  0.2  0.2  0.2  0.2  0.2  0.4  0.3  0.2  0.2  0.1  ... 

Factor  w/  3  LeveLs  "setosa" j  "versicoLor" }  ..’.111111 


##  SepaL . Length  SepaL.  Width  PetaL. Length  PetaL .Width  Species 


##  1 

5.1 

3.5 

1.4 

0.2 

setosa 

##  2 

4.9 

3.0 

1.4 

0.2 

setosa 

##  3 

4.7 

3.2 

1.3 

0.2 

setosa 

##  4 

4.6 

3.1 

1.5 

0.2 

setosa 

##  5 

5.0 

3.6 

1.4 

0.2 

setosa 

##  6 

5.4 

3.9 

1.7 

0.4 

setosa 

## 

##  setosa  versicoLor  virginica 

##  50  50  50 


The  ctree (Species  ~  Sepal. Length  +  Sepal. Width  +  Petal. 
Length  +  Petal. Width,  data=iris)  function  will  build  a  decision  tree 
(Figs.  9.1  and  9.2). 


iris_ctree  <-  ctree(Species  ~  SepaL . Length  +  SepaL .Width  +  PetaL . Length  +  Pe 
taL. Width j  data=iris) 
print ( iris_ctree ) 

##  ConditionaL  inference  tree  with  4  terminaL  nodes 
## 

##  Response:  Species 

##  Inputs:  SepaL. Length j  SepaL. Width ,  PetaL. Length }  PetaL. Width 
##  Number  of  observations :  150 

## 

##  1 )  PetaL . Length  <=  1.9 ;  criterion  =  1}  statistic  =  140.264 
##  2)*  weights  =  50 

##  1)  PetaL. Length  >1.9 

##  3)  PetaL. Width  <=  1.7;  criterion  =  1}  statistic  =  67.894 

##  4)  PetaL. Length  <=  4.8;  criterion  =  0.999 ,  statistic  =  13.865 

##  5)*  weights  =  46 

##  4)  PetaL. Length  >4.8 

##  6)*  weights  =  8 

##  3)  PetaL. Width  >1.7 

##  7)*  weights  =  46 


piot(iris_ctreej  cex=2) 


9.2  Hands-on  Example:  Iris  Data 


309 


Node  2  (n  =  50) 


1  - 


0.8  - 


0.6  - 


0.4  - 


0.2  - 


t - r 

eetoEa  vEfsicotorvirginica 


Made  6  (n  =  S) 


1  - 

1  - 

t  - 

O.fl  - 

0.&  - 

0.8  “ 

0.6-  - 

O.B  - 

0.6  - 

0  .4  - 

04  - 

0.4  - 

0.2  - 

0.2  - 

0.2  - 

0 

1 

'  1  1 

0 

1 

r 

i 

0  -L 

Nod*  7  in  =  461 


sc’n&a  vcrs-i-colorvii^ginica 


seto&a  vcrsicolorvirginiM 


T” — i —  — r 

E-cLosa  versiedof virg  inica 


Fig.  9.1  Decision  tree  classification  illustrating  four  leaf  node  labels  corresponding  to  the  three  iris 
genera 


setosa 
.33  .33  .33 

^  100%  J 


[yes)  Petal  .Length  <  2Ajno}JIl 


versicolor 
.00  .50  .50 
67% 


PetaLWidth  <  1.8 


_ m _ 

m 

r  setosa  ^ 
1.00  .00  .00 
v  33%  ^ 

^  versicolor  ^ 
.00  .91  .09 

^  36%  J 

f'  ■  T— SSs 

virginica 
.00  .02  .98 

L  31%  J 

Rattle  2017-Jun-20  16:06:30  Dlnov 


Fig.  9.2  An  alternative  decision  tree  classification  of  the  iris  flowers  dataset 
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head( iris); 

##  Sepal. 

taiL(iris) 

Length  Sepal  .\4idth  Petal .  Length  Petal 

.Width  Species 

## 

1 

5.1 

3.5 

1.4 

0.2  setosa 

## 

2 

4.9 

3.0 

1.4 

0.2  setosa 

## 

3 

4.7 

3.2 

1.3 

0.2  setosa 

## 

4 

4.6 

3.1 

1.5 

0.2  setosa 

## 

5 

5.0 

3.6 

1.4 

0.2  setosa 

## 

6 

5.4 

3.9 

1.7 

0.4  setosa 

## 

Sepal. Length  Sepal 

.  Width 

Petal. Length  Petal .Width  Species 

## 

145 

6.7 

3.3 

5.7 

2.5  virginica 

## 

146 

6.7 

3.0 

5.2 

2.3  virginica 

## 

147 

6.3 

2.5 

5.0 

1.9  virginica 

## 

148 

6.5 

3.0 

5.2 

2.0  virginica 

## 

149 

6.2 

3.4 

5.4 

2.3  virginica 

## 

150 

5.9 

3.0 

5.1 

1.8  virginica 

Similarly,  we  can  demonstrate  a  classification  of  the  iris  taxa  via  rpart: 


Library (rpart) 

iris_rpart  =  rpart (Species^. ,  data=iris) 
print (iris_rpart) 


##  n=  150 
## 

##  node)j  spLitj 


n}  LosSj  yvaLj  (yprob) 


##  *  denotes  terminal  node 

## 

##  1)  root  150  100  setosa  (0.33333333  0.33333333  0.33333333 ) 

##  2)  Petal. Length <  2.45  50  0  setosa  (1.000  0.00000000  0.00000000)  * 

##  3)  Petal. Length>=2. 45  100  50  versicolor  (0.000  0.50000000  0.50000000) 

##  6)  Petal.  Width<  1.75  54  5  versicolor  (0.000  0.90740741  0.09259259)  * 

##  7)  Petal. Width>=l. 75  46  1  virginica  (0.000  0.02173913  0.97826087)  * 


#  Use  the  ' rattle : rfancyRpartPlot'  to  generates  an  elegant  plot 
Library (rattle) 


##  Rattle:  A  free  graphical  interface  for  data  mining  with  R. 
##  Version  4.1.0  Copyright  (c)  2006-2015  Togaware  Pty  Ltd. 

##  Type  'rattle()'  to  shahej  rattlej  and  roll  your  data. 

fancyRpartPLot(iris_rpartj  cex  =  1.5) 


9.3  Decision  Tree  Overview 

The  decision  tree  algorithm  represents  an  upside  down  tree  with  lots  of  tree  branch 
bifurcations  where  a  series  of  logical  decisions  are  encoded  as  tree  node  splits.  The 
classification  begins  at  the  root  node  and  goes  through  many  branches  until  it  gets  to 
the  terminal  nodes.  This  iterative  process  splits  the  data  into  different  classes  by  rigid 
criteria. 
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9.3.1  Divide  and  Conquer 


Decision  trees  involve  recursive  partitioning  that  uses  data  features  and  attributes  to 
split  the  data  into  groups  (nodes)  of  similar  classes. 

To  make  classification  trees  using  data  features,  we  need  to  observe  the  pattern 
between  the  data  features  and  potential  classes  using  training  data.  We  can  draw 
scatter  plots  and  separate  groups  that  are  clearly  clotted  together.  Each  group  is 
considered  a  segment  of  the  data.  After  getting  the  approximate  range  of  each  feature 
value  under  each  group,  we  can  make  the  decision  tree. 


X  =  [xux2,x3,...,xk\ 


*1,1 

*1,2 

•  •  •  *1  ,k  \ 

*2,1 

*2,2 

•  •  •  *2  ,k 

*n,  1 

*«,  2 

•  •  •  *«,fc  ) 

features  /  attributes 


cases 


The  decision  tree  algorithms  use  a  top-down  recursive  divide-and-conquer 
approach  (sometimes  they  may  also  use  bottom  up  or  mixed  splitting  strategies)  to 
divide  and  evaluate  the  splits  of  a  dataset  D  (input).  The  best  split  decision  corre¬ 
sponds  to  the  split  with  the  highest  information  gain,  reflecting  a  partition  of  the 
data  into  K  subsets  (using  divide-and-conquer).  The  iterative  algorithm  terminates 
when  some  stopping  criteria  are  reached.  Examples  of  stopping  conditions  used  to 
terminate  the  recursive  process  include: 

•  All  the  samples  belong  to  the  same  class,  that  is  they  have  the  same  label  and  the 
sample  is  already  pure. 

•  Stop  when  majority  of  the  points  are  already  of  the  same  class  (relative  to  some 
error  threshold). 

•  There  are  no  remaining  attributes  on  which  the  samples  may  be  further 
partitioned. 

One  objective  criteria  for  splitting  or  clustering  data  into  groups  is  based  on  the 
information  gain  measure,  or  impurity  reduction,  which  can  be  used  to  select  the 
test  attribute  at  each  node  in  the  decision  tree.  The  attribute  with  the  highest 
information  gain  (i.e.,  greatest  entropy  reduction)  is  selected  as  the  test  attribute 
for  the  current  node.  This  attribute  minimizes  the  information  needed  to  classify  the 
samples  in  the  resulting  partitions.  There  are  three  main  indices  to  evaluate  the 
impurity  reduction:  Misclassification  error,  Gini  index  and  Entropy. 

For  a  given  table  containing  pairs  of  attributes  and  their  class  labels,  we  can  assess 
the  homology  of  the  classes  in  the  table.  A  table  is  pure  (homogenous)  if  it  only 
contains  a  single  class.  If  a  data  table  contains  several  classes,  then  we  say  that  the 
table  is  impure  or  heterogeneous.  This  degree  of  impurity  or  heterogeneity  can  be 
quantitatively  evaluated  using  impurity  measures  like  entropy,  Gini  index,  and 
misclassification  error. 
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9.3.2  Entropy 

The  Entropy  is  an  information  measure  of  the  amount  of  disorder  or  uncertainty  in 
a  system.  Suppose  we  have  a  data  set  D  =  (Xx,X2, . .  .,Xn)  that  includes  n  features 
(variables)  and  suppose  each  of  these  features  can  take  on  any  of  k  possible  values 
(states).  Then  the  cardinality  of  the  entire  system  is  kn  as  each  of  the  features  are 
assumed  to  have  k  independent  states,  thus  the  total  number  of  different  datasets  that 
can  be  expected  is  k  x  k  x  . . .  x  k  =  kn.  Suppose  pi,  p2,  represent  the 

n 

proportions  of  each  class  (note:  J2iPi  =  1 )  present  in  the  child  node  that  results 
from  a  split  in  a  decision  tree  classifier.  Then  the  entropy  measure  is  defined  by: 

Entropy  (D)  =  -^T^pjog^p,. 

If  each  of  the  1  <  i  <  k  states  for  each  feature  is  equally  likely  to  be  observed  with 
probability  pt  =  then  the  entropy  is  maximized: 

Entropy (D)  =  -  =  I  $+  =  1- 

-*  Av  Av  Av  Av  1 

1=1  1=1  i=i 

In  the  other  extreme,  the  entropy  is  minimized.  Note  that  by  L’Hopital’s  Rule 

i 

lim^o*  x  l°g{x)  =  lim^o  zr  =  limx^o-r  =  0  )  for  a  single  class  classification 

X^- 

where  the  probability  of  one  class  is  unitary  (pt  =  1 )  and  the  other  ones  are  trivial 
(Pi#0  =  0): 

Entropy (D)  =  ~^2^log(k)  =  Pio  x  log(pio)  +  '^2¥iPi  logipi)  = 

=  1  x  log(l)  +  lim  v  log(x)  =0  +  0  =  0. 

In  classification  settings,  higher  entropy  (i.e.,  more  disorder)  corresponds  to  a 
sample  that  has  a  mixed  collection  of  labels.  Conversely,  lower  entropy  corresponds 
to  a  classification  where  we  have  mostly  pure  partitions.  In  general,  the  entropy  of  a 
sample  D  =  {xi,x2, . .  .,xn}  is  defined  by: 

k 

H(D )  =  -^/>(c,|D)log/>(c,|D), 

i=  1 

where  P{ct\  D)  is  the  probability  of  a  data  point  in  D  being  labeled  with  class  cb 
and  k  is  the  number  of  classes  (clusters).  P{ct I D)  can  be  estimated  from  the  observed 
data  by: 
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{xj&D 

xj  has  label  y;-  =  c,  } 

D 

p(a\D)  = 


Observe  that  if  the  observations  are  evenly  split  amongst  all  k  classes,  then  P 

(« Ci\D )  =  \  and 


1=1 

At  the  other  extreme,  if  all  the  observations  are  from  one  class  then: 

H{D)  =  -l*logk(l)  =  0. 

Also  note  that  the  base  of  the  log  function  is  somewhat  irrelevant  and  can  be  used 
to  normalize  (scale)  the  range  of  the  entropy  (logb  (. x )  =  \ogf(b)) . 

The  Gain  is  the  expected  reduction  in  entropy  caused  by  knowing  the  value  of  an 
attribute. 


9.3.3  Misclassification  Error  and  Gini  Index 

Similar  to  the  Entropy  measure,  the  Misclassification  error  and  the 
Gini  index  are  also  applied  to  evaluate  information  gain.  The  Misclassification 
error  is  defined  by  the  formula: 


ME  =  1  —  max  (pk). 

k 

And  the  Gini  index  is  expressed  as: 


k 


gi = xau  -Pk) 


9.3.4  C5.0  Decision  Tree  Algorithm 

C5.0  algorithm  is  a  popular  implementation  of  decision  trees. 

To  begin  with,  let’s  consider  the  term  purity.  If  the  segments  of  data  contains  a 
single  class,  they  are  considered  pure.  The  entropy  represents  a  mathematical 
formalism  measuring  purity  of  data  segments. 
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Entropy (S)  =  -  ^pjogjjp^ 

i=  1 

where  entropy  is  the  measurement,  c  is  the  number  of  total  class  levels,  and  pt 
refers  to  the  proportion  of  observations  that  fall  into  each  class  (i.e.,  probability  of  a 
randomly  selected  data  point  to  belong  to  the  ith  class  level.  For  two  possible  classes, 
the  entropy  ranges  from  0  to  1.  For  n  classes,  the  entropy  ranges  from  0  to  log2(n ), 
where  the  minimum  entropy  corresponds  to  data  that  is  purely  homogeneous 
(completely  deterministic/predictable)  and  the  maximum  entropy  represents 
completely  disordered  data  (stochastic  or  extremely  noisy).  You  might  wonder 
what  is  the  benefit  of  using  the  entropy?  Another  way  to  say  this  is  the  smaller  the 
entropy,  the  more  information  is  contained  in  this  split  method.  Systems  (data)  with 
high  entropy  indicate  significant  information  content  (randomness)  and  data  with 
low  entropy  indicates  highly-compressible  data  with  structure  in  it. 

If  we  only  have  one  class  in  the  segment,  then  Entropy (S)  =  (— 1)  x  log2(  1)  =  0. 

Let’s  try  another  example.  If  we  have  a  segment  of  data  that  contains  two  classes, 
the  first  class  contains  80%  of  the  data  and  the  second  class  contains  the  remaining 
20%.  Then,  we  have  the  following  entropy: 

Entropy(S)  =  —0.8  log2(0.%)  —  0.2log2(0.2)  =  0.7219281. 

The  relationship  for  two  class  proportions  and  entropy  are  illustrated  in  Fig.  9.3, 
where  v  is  the  proportion  for  elements  in  one  of  the  classes. 


Entropy  for  Different  Proportions 


Fig.  9.3  Plot  of  the  entropy  of  a  (symmetric)  binary  process  as  a  function  of  the  proportion  of  class 
1  cases 
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set, seed (1234) 
x<-runif( 100) 

curve( -x*Log2(x) - (l-x)*Log2(l-x) j  coL="red"j  main=" Entropy  for  Different 
Proportions" j  xLab  =  "x  (proportion  for  cLass  1)" }  yLab  =  "Entropy ",  Lwd=3) 

The  closer  the  binary  proportion  split  is  to  0.5,  the  greater  the  entropy.  The  more 
homogeneous  the  split  (one  class  becomes  the  majority)  the  lower  the  entropy. 
Decision  trees  aim  to  find  splits  in  the  data  that  reduce  the  entropy,  i.e.,  increasing 
the  homogeneity  of  the  elements  within  all  classes. 

This  measuring  mechanism  could  be  used  to  measure  and  compare  the  informa¬ 
tion  we  get  using  different  features  as  data  partitioning  characteristics.  Let’s  consider 
this  scenario.  Suppose  S  and  Si  represent  the  entropy  of  the  system  before  and  after 
the  splitting/partitioning  of  the  data  according  to  a  specific  data  feature  attribute  (F). 
Denote  the  entropies  of  the  original  and  the  derived  partition  by  Entropy(S)  and 
Entropy (S 0,  respectively.  The  information  we  gained  from  partitioning  the  data 
using  this  specific  feature  (F)  is  calculated  as  a  change  in  the  entropy: 

Gain(F)  =  Entropy  (S)  —  Entropy  (S  i). 

Note  that  smaller  entropy  Entropy  (Si)  corresponds  with  better  classification  and 
more  information  gained.  A  more  complicated  case  would  be  that  the  partitions 
create  multiple  segments.  Then,  the  entropy  for  each  partition  method  is  calculated 
by  the  following  formula: 


n  n  /  c  \ 

Entropy(S)  =  ^  WiEntropy(Pi )  =  w,  I  - pilog2(pi )  , 

i= 1  <=  1  \J=  1  / 

where  wt  is  the  proportion  of  examples  falling  in  that  segment  and  Pt  is  segment  i. 
Thus,  the  total  entropy  of  a  partition  method  is  calculated  by  a  weighted  sum  of 
entropies  for  each  segment  created  by  this  method. 

When  we  get  the  maximum  reduction  in  entropy  with  a  feature  (F),  then  the  Gain 
(F)  =  Entropy (S ),  since  Entropy (S])  =  0.  On  the  contrary,  if  we  gain  no  information 
with  this  feature,  we  have  Gain(E)  =  0. 


9.3.5  Pruning  the  Decision  Tree 

While  making  a  decision  tree,  we  can  classify  those  observations  using  as  many 
splits  as  we  want.  This  eventually  might  over  classify  our  data.  An  extreme  example 
of  this  would  be  that  we  make  each  observation  as  a  class,  which  is  meaningless. 

So  how  do  we  control  the  size  of  the  decision  tree?  One  possible  solution  is  to 
make  a  cutoff  for  the  number  of  decisions  that  a  decision  tree  could  make.  Similarly, 
we  can  control  the  number  of  examples  in  each  segment  to  be  not  too  small.  This 
method  is  called  early  stopping  or  pre-pruning  the  decision  tree.  However,  this 
might  make  the  decision  procedure  stop  prematurely,  before  some  important  parti¬ 
tion  occurs. 
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Another  solution  post-pruning  is  that  we  begin  with  growing  a  big  decision  tree 
and  subsequently  reduce  the  branches  based  on  error  rates  with  penalty  at  the  nodes. 
This  is  often  more  effective  than  the  pre-prunning  solution. 

The  C5.0  algorithm  uses  the  post-pruning  method  to  control  the  size  of  the 
decision  tree.  It  first  grows  an  overfitting  large  tree  to  contain  all  the  possibilities  of 
partitioning.  Then,  it  cuts  out  nodes  and  branches  with  little  effect  on  classification 
errors. 


9.4  Case  Study  1 :  Quality  of  Life  and  Chronic  Disease 
9.4.1  Step  1:  Collecting  Data 

In  this  Chapter,  we  are  using  the  Quality  of  life  and  chronic  disease  dataset, 
CaseO 6_QoL_Symptom_ChronicI llness  .  csv.  This  dataset  has  41  vari¬ 
ables.  Detailed  description  for  each  variable  is  provided  here  (https ://umich. 
instructure,  com/files/3 99 150/ do  wnload?do  wnload_frd=  1 ) . 

Important  variables: 

•  Charlson  Comorbidity  Index:  ranging  from  0  to  10.  A  score  of  0  indicates  no 
comorbid  conditions.  Higher  scores  indicate  a  greater  level  of  comorbidity. 

•  Chronic  Disease  Score:  A  summary  score  based  on  the  presence  and  complexity 
of  prescription  medications  for  select  chronic  conditions.  A  high  score  in  decades 
the  patient  has  severe  chronic  diseases.  Entries  stored  as  —9  indicate  missing  value. 


9.4.2  Step  2:  Exploring  and  Preparing  the  Data 

Let’s  load  the  data  first. 


qoL<-read. csv (" https :// umich . instructure . com/fiLes/481332/downLoad?downLoad_ 
frd=l ") 
str(qoi ) 


## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


' data .  frame ' :  2356 

$  ID 

$  INTERVIEIa/DATE 
$  LANGUAGE 
$  AGE 

$  RACE_ETHNICITY 
$  SEX 
$  QOL_Q_01 
$  QOL_Q_02 
$  QOL_Q_03 
$  QOL_Q_04 
$  QOL_Q_05 
$  QOL_Q_06 


obs.  of  41  variables: 

:  int  171  171  172  179  180  180  181  182  183  186  ... 
:  int  0  427  000  42  0000  ... 

:  int  1111111112... 

:  int  49  49  62  44  64  64  52  48  49  78  .. . 

:  int  3337333334... 

:  int  2222112111... 

:  int  4436334235... 

:  int  4336254146... 

:  int  4446364344... 

:  int  4426362252... 

:  int  1546264343... 

:  int  4426124124... 
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## 

$ 

QOL_Q_07 

int 

1 

2 

5 

## 

$ 

QOL_Q_08 

int 

6 

1 

3 

## 

$ 

QOL_Q_09 

int 

3 

4 

3 

## 

$ 

QOL_Q_10 

int 

3 

1 

3 

## 

$ 

MSA_Q_01 

int 

1 

3 

2 

## 

$ 

MSA_Q_02 

int 

1 

1 

2 

## 

$ 

MSA_Q_03 

int 

2 

1 

2 

## 

$ 

MSA_Q_04 

int 

1 

3 

2 

## 

$ 

MSA_Q_05 

int 

1 

1 

1 

## 

$ 

MSA_Q_06 

int 

1 

2 

2 

## 

$ 

MSA_Q_07 

int 

2 

1 

3 

## 

$ 

MSA_Q_08 

int 

1 

1 

1 

## 

$ 

MSA_Q_09 

int 

1 

1 

1 

## 

$ 

MSA_Q_10 

int 

1 

1 

1 

## 

$ 

MSA_Q_11 

int 

2 

3 

2 

## 

$ 

MSA_Q_12 

int 

1 

1 

2 

## 

$ 

MSA_Q_13 

int 

1 

1 

1 

## 

$ 

MSA_Q_14 

int 

1 

1 

1 

## 

$ 

MSA_Q_15 

int 

2 

1 

1 

## 

$ 

MSA_Q_16 

int 

2 

3 

5 

## 

$ 

MSA_Q_17 

int 

2 

1 

1 

## 

$ 

PH2_Q_01 

int 

3 

2 

1 

## 

$ 

PH2_Q_02 

int 

4 

4 

1 

## 

$ 

TOS_Q_01 

int 

2 

2 

2 

## 

$ 

TOS_Q_02 

int 

1 

1 

1 

## 

$ 

TOS_Q_03 

int 

4 

4 

4 

## 

$ 

TOS_Q_04 

int 

5 

5 

5 

## 

$ 

CHARLSONSCORE 

int 

2 

2 

3 

## 

$ 

CHRONICDISEASESCORE 

num 

1. 

6 

1 

.51 


-1058437... 

6663124  ... 

6224224  ... 

6363243  ... 

6234112.. . 

6164324.. . 

6123312.. . 

6121415.. . 

6121632.. . 

6121122.. . 

6111115.. . 

6111121  ... 

6224621  ... 

6111113.. . 

6112115.. . 

6112613.. . 

6162142.. . 

6121131  ... 

6113213.. . 

6121212.. . 

6111113.. . 

5113123.. . 

5121142.. . 

4112211  ... 

4441244.. . 

4444444  ... 

5555555  ... 

1002801  ... 

6  1.54  2.97  1.28  1.28  1.31  1.67  2.21  2 


Most  of  the  coded  variables  like  QOL_Q_01  (heath  rating)  have  ordinal  values 
(1  =  excellent,  2  =  very  good,  3  =  good,  4  =  fair,  5  =  poor,  6  =  no  answer).  We  can 
use  the  table  ( )  function  to  see  their  distributions.  We  also  have  some  numerical 
variables  in  the  dataset  like  CHRONICDISEASESCORE.  We  can  take  a  look  at  it  by 
using  summary  ( )  . 

Our  variable  of  interest  CHRONICDISEASESCORE  has  some  missing  data.  A 
simple  way  to  address  this  is  just  deleting  those  observations  with  missing  values. 
You  could  also  try  to  impute  the  missing  value  using  various  imputation  methods 
mentioned  in  Chap.  3. 


table (qoi$QOL_Q_01) 

## 

##  1  2  3  4  5  6 

##  44  213  801  900  263  135 

qol<-qol[! qol$CHRONICDISEASESCORE== - 9 ,  ] 
summary ( qol$CHRONICDISEASESCORE ) 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  0.000  0.880  1.395  1.497  1.970  4.760 
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Let’s  create  two  classes  using  variable  CHRONICDISEASESCORE. 
We  classify  the  patients  with  CHRONICDISEASESCORE  <  mean 
(CHRONICDISEASESCORE)  as  having  minor  disease  and  the  rest  as  having 
severe  disease.  This  dichotomous  classification  (qol$cd)  may  not  be  perfect  and 
we  will  talk  about  alternative  classification  strategies  in  the  practice  problem  in  the 
end  of  the  chapter. 

qoL$cd< -qoL$CHRONICDISEASESCORE>l . 497 

qoL$cd<-f actor (qoL$cdj  LeveLs=c(Fj  T),  LabeLs  =  c( "minor_disease"j  "severe_d 
isease") ) 


Data  Preparation:  Creating  Random  Training  and  Test  Datasets 

To  make  the  qol  data  more  organized,  we  can  order  the  data  by  the  variable  ID. 
qoL<-qoL[order(qoL$ID)j  ] 

#  Remove  ID  (col=l)  #  the  clinical  Diagnosis  (col=41)  will  be  handled  later 
qoL  <-  qoL[  ,  - 1 ] 

Then,  we  are  able  to  subset  training  and  testing  datasets.  Here  is  an  example  of  a 
non-random  split  of  the  entire  data  into  training  (21 14)  and  testing  (100)  sets: 

qoL_train<-qoL[l:2114j  ] 
qoL_test< -qoL [2115 : 2214 j  ] 

And  here  is  an  example  of  random  assignments  of  cases  into  training  and  testing 
sets  (80-20%  slit): 


set.seed(1234) 

train_index  <-  sampLe(seq_Len(nrow(qoi) ) ,  size  =  0.8*nrow(qoL)) 
qoL_train<-qoi[train_indeXj  ] 
qoL_test<-qoL[ -train_indeXj  ] 

We  can  quickly  inspect  the  distributions  of  the  training  and  testing  data  to  ensure 
they  are  not  vastly  different.  We  can  see  that  the  classes  are  split  fairly  equal  in 
training  and  testing  datasets. 

prop,tabLe(tabLe(qoL_train$cd) ) 

##  minor_disease  severe_disease 
##  0.5279503  0.4720497 

prop.tabie(tabie(qoi_test$cd) ) 

##  minor_disease  severe_disease 
##  0.503386  0.496614 
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9.4.3  Step  3:  Training  a  Model  On  the  Data 

In  this  section,  we  are  using  the  C5 . 0  ( )  function  from  the  C5  0  package. 
The  function  C5 . 0  ( )  has  following  components: 

m<-C5.0(train,  class,  trials=l,  costs=NULL) 


•  train:  data  frame  containing  numeric  training  data  (features). 

•  class:  factor  vector  with  the  class  for  each  row  in  the  training  data. 

•  trials:  an  optional  number  to  control  the  boosting  iterations  (default  =1). 

•  costs:  an  optional  matrix  to  specify  the  costs  of  false  positive  and  false  negative. 

You  could  delete  the  #  in  the  following  code  and  run  it  in  R  to  install  and  load  the 
C5  0  package. 


#  install. packages("C50") 
library (C50) 

In  the  qol  dataset  (ID  column  is  already  removed),  column  41  is  the  class  vector 
(qol$cd),  and  column  40  is  the  numerical  version  of  vector  41  (qol 
$CHRONICDISEASESCORE).  We  need  to  delete  these  two  columns  to  create  our 
training  data  that  only  contains  features. 


summary (qoi_tra\n[ j  -c (40 j  41)]) 


## 

INTERVIEIa/DATE 

LANGUAGE 

## 

Min. 

0.00 

Min.  : 1.000 

## 

1st  Qu. 

0.00 

1st  Qu. : 1.000 

## 

Median 

0.00 

Median  : 1.000 

## 

Mean 

21.68 

Mean  : 1.217 

## 

3rd  Qu. 

0.00 

3rd  Qu. : 1.000 

## 

Max. 

440. 00 

Max.  : 2.000 

## 

SEX 

QOL_Q_01 

## 

Min. 

■1.000 

Min.  : 1.000 

## 

1st  Qu. 

■1.000 

1st  Qu. : 3.000 

## 

Median 

1.000 

Median  :4.000 

## 

Mean 

1.422 

Mean  :3.661 

## 

3rd  Qu. 

■2.000 

3rd  Qu. :4.000 

## 

Max. 

■2.000 

Max.  :6.000 

## 

TOS_Q_03 

TOS_Q_04 

## 

Min.  : 1.000 

Min.  : 1.000 

## 

1st  Qu. :4.000 

1st  Qu. : 5.000 

## 

Median  :4.000 

Median  : 5.000 

## 

Mean  : 3.787 

Mean  :4.686 

## 

3rd  Qu. :4.000 

3rd  Qu. : 5.000 

## 

Max.  : 5.000 

Max.  :6.000 

AGE 

RACE_ETHNICITY 

Min. 

:  20.00 

Min.  : 1.000 

1st  Qu. 

: 52.00 

1st  Qu. :3.000 

Median 

: 59.00 

Median  :3.000 

Mean 

:  58.74 

Mean  :3.614 

3rd  Qu. 

:67 . 00 

3rd  Qu. :4.000 

Max. 

: 90 . 00 

Max.  :7.000 

QOL_Q_02 

QOL_Q_03 

Min. 

■1.000 

Min.  : 1.000 

1st  Qu. 

■3.000 

1st  Qu. :3.000 

Median 

■3.000 

Median  :4.000 

Mean 

■3.408 

Mean  :3.714 

3rd  Qu. 

4.000 

3rd  Qu. :4.000 

Max. 

6 . 000 

Max.  :6.000 

CHARLSONSCORE 

Min. 

-9 . 0000 

1st  Qu. 

0. 0000 

Median 

1 . 0000 

Mean 

0.8826 

3rd  Qu. 

1 . 0000 

Max. 

10.0000 

set.seed(1234) 

qoL_modeL<-C5.0(qoL_train[j-c(40j  41)] ,  qoL_train$cd) 
qoi_modei 
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## 

##  Call: 

##  C5.0.default(x  =  qol_train[j  -c(40j  41)] }  y  =  qoL_train$cd) 

## 

##  Classification  Tree 
##  Number  of  samples:  1771 
##  Number  of  predictors :  39 
## 

##  Tree  size:  25 
## 

##  Non-standard  options:  attempt  to  group  attributes 
summary ( qol_model ) 

## 

##  Call: 

##  C5.0.default(x  =  qol_train[j  -c(40j  41)] }  y  =  qol_train$cd) 

## 

## 

##  C5.0  [Release  2.07  GPL  Edition]  Tue  Jun  20  16:09:16  2017 

## - 

## 

##  Class  specified  by  attribute  ' outcome ' 

## 

##  Read  1771  cases  (40  attributes)  from  undefined . data 
## 

##  Decision  tree: 

## 

##  CHARLSONSCORE  <=  0:  minor_disease  (665/180) 

##  CHARLSONSCORE  >  0: 


## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


: . .  ./AGE  <=  47: 

:...MSA_Q_08  >  2:  severe_disease  (15/4) 

MSA_Q_08  <=  2: 

:...MSA_Q_14  <=  1:  minor_disease  (86/20) 

MSA_Q_14  >  1: 

:...MSA_Q_10  >  4:  minor_disease  (6) 

MSA_Q_10  <=  4: 

: . .  .TOS_Q_03  >  4:  severe_disease  (8) 

TOS_Q_03  <=  4: 

:...MSA_Q_17  >  2:  minor_disease  (8/1) 
MSA_Q_17  <=  2: 

: . . .QOL_Q_01  <=  2:  minor_disease  (4) 

QOL_Q_01  >  2:  severe_disease  (38/13) 

AGE  >  47: 

: . . . RACE_ETHNICITY  >  3: 

: . . -QOL_Q_07  >  5:  severe_disease  (133/26) 

QOL_Q_07  <=  5: 

: . . .QOL_Q_10  >  5:  severe_disease  (24/2) 

QOL_Q_10  <=  5: 

:...MSA_Q_14  <=  5:  severe_disease  (202/72) 
MSA_Q_14  >  5:  minor_disease  (11/2) 
RACE_ETHNICITY  <=  3: 

: . . .QOL_Q_01  <=  2:  minor_disease  (50/20) 

QOL_Q_01  >  2: 

.CHARLSONSCORE  >  1:  severe_disease  (184/58) 
CHARLSONSCORE  <=  1: 

:...MSA_Q_04  >  5:  minor_disease  (27/8) 
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## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

##  Evaluation  on  training  data  (1771  cases): 
## 

##  Decision  Tree 


## 


## 

Size  i 

Errors 

## 

25  497(28.1%) 

<< 

## 

## 

(O)  (b) 

<-cLassified  as 

## 

-  - 

## 

726  209 

(a): 

cLass  minor_disease 

## 

203  548 

(b): 

class  severe_disease 

## 

## 

##  Attribute  usage: 

## 

##  100.00^  CHARLSONSCORE 

##  62.45%  AGE 

##  53.13%  RACE_ETHNICITY 

##  38.40%  QOL_Q_07 

##  34.61%  QOL_Q_01 

##  21 . 91#  MSA_Q_14 

##  19.03%  MSA_Q_04 

##  13.38%  QOL_Q_10 

##  10.50%  QOL_Q_05 

##  9.32%  MSA_Q_08 

##  8.13%  MSA_Q_17 

##  7.28%  MSA_Q_06 

##  7.  00#  QOL_Q_09 

##  4.23#  PH2_Q_01 

##  3.61%  MSA_Q_10 

##  3.27#  TOS_Q_03 

##  3.22#  TOS_Q_04 

The  output  of  qol_model  indicates  that  we  have  a  tree  that  has  25  terminal 
nodes,  summary  (qol_model)  suggests  that  the  classification  error  for  decision 
tree  is  28%  in  the  training  data. 


MSA_Q_04  <=  5: 

: .  .  .  QOL_Q_07  <=  5: 

: .  .  .  QOL_Q_05  <=  2: 

: . .  .TOS_Q_04  <=  2:  minor_disease  (5) 

TOS_Q_04  >  2:  severe_disease  (52/15) 
QOL_Q_05  >  2: 

:...MSA_Q_06  <=  5:  minor_disease  (119/46) 
MSA_Q_06  >  5:  severe_disease  (10/2) 
QOL_Q_07  >  5: 

:...QOL_Q_09  <=  2:  severe_disease  (18/1) 

QOL_Q_09  >  2: 

: . . . RACE_ETHNICITY  <=  2:  minor_disease  (12/5) 
RACE_ETHNICITY  >  2: 

:...MSA_Q_17  >  3:  severe_disease  (19/2) 
MSA_Q_17  <=  3: 

...PH2_Q_01  <=  3:  severe_disease  (50/14) 
PH2_Q_01  >  3: 

. . . MSA_Q_14  <=  3:  minor_disease  (21/6) 
MSA_Q_14  >  3:  severe_disease  (4) 
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9.4.4  Step  4:  Evaluating  Model  Performance 

Now  we  can  make  predictions  using  the  decision  tree  that  we  just  built.  The 
predict  ()  function  we  will  use  is  the  same  as  the  one  we  showed  in  earlier 
chapters,  e.g.,  Chaps.  3  and  8.  In  general,  predict  ( )  is  extended  by  each  specific 
type  of  regression,  classificaiton,  clustering,  or  forecasting  machine  learning  tech¬ 
nique.  For  example,  randomForest ::  predict .  randomForest  ( )  is 
invoked  by: 


predict (RFjnodel,  newdata,  type="response",  norm. votes=TRUE, 
predict . all=FALSE,  proximity=FALSE ,  nodes=FALSE,  cutoff,  ...), 

where  type  represents  type  of  prediction  output  to  be  generated  -  "response" 
(equivalent  to  "class"),  "prob"  or  "votes".  Thus,  the  predicted  values  are  either 
predicted  "response"  class  labels,  matrix  of  class  probabilities,  or  vote  counts. 

This  time  we  are  going  to  introduce  the  confusionMatrix  ( )  function  under 
package  caret  as  the  evaluation  method.  When  we  combine  it  with  a  table  ( ) 
function,  the  output  of  the  evaluation  is  very  straight  forward. 


qoL_pred<- predict (qoL_modeLj  qoL_test[  }-c(40j  41)])  #  removing  the  last  2 

columns  CHRONICDISEASESCORE  and  cd,  which  represent  the  clinical  outcomes  we 
are  predicting! 

#  install. packages("caret") 

Library ( caret) 

confusionMatrix ( tabLe(qoL_predj  qoi_test$cd) ) 


##  Confusion  Matrix  and  Statistics 
## 


##  qoL_pred  minor_disease  severe_disease 


## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


minor_disease 

severe_disease 

Accuracy 
95%  Cl 

No  Information  Rate 
P-VaLue  [Acc  >  NIR] 

Kappa 

Mcnemar's  Test  P-VaLue 

Sensitivity 
Specificity 
Pos  Pred  Value 
Neg  Pred  Value 
Prevalence 
Detection  Rate 
Detection  Prevalence 
Balanced  Accuracy 

'Positive'  Class 


149  89 

74  131 

0. 6321 

( 0.5853 ,  0.6771) 
0.5034 
3 . 317e-08 

0.2637 

0.2728 


0.6682 

0.5955 

0.6261 

0.6390 

0.5034 

0.3363 

0.5372 

0.6318 


minor  disease 
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The  Confusion  Matrix  shows  that  the  testing  data  prediction  accuracy  is  about 
63%.  However,  this  may  vary  (see  the  corresponding  confidence  interval). 


9.4.5  Step  5:  Trial  Option 

The  C5 . 0  function  includes,  an  option  trial s=,  which  is  an  integer  specifying 
the  number  of  boosting  iterations.  The  default  value  of  one  indicates  that  a  single 
model  is  used,  and  we  can  specify  a  larger  number  of  iterations,  for  instance 
trials=6. 

set.seed(1234) 

qoL_boost6<-C5.0(qoL_train[  ,  -c(40j  41)]}  qoL_train$cdj  triaLs=6)  #  try  alt 

ernative  values  for  the  trials  option 

qoL_boost6 

## 

##  Call: 

##  C5.0.default(x  =  qoL_train[j  -c(40j  41)] ,  y  =  qoL_train$cdj  trials  =  6) 

## 

##  Classification  Tree 
##  Number  of  samples:  1771 
##  Number  of  predictors :  39 
## 

##  Number  of  boosting  iterations :  6 
##  Average  tree  size:  11.7 
## 

##  Non-standard  options:  attempt  to  group  attributes 

We  can  see  that  the  size  of  the  tree  reduced  to  about  12  (this  may  vary  at 
each  run). 

Since  this  is  a  fairly  small  tree,  we  can  visualize  it  by  the  function  plot  ( ) .  We 
also  use  the  option  type=" simple"  to  make  the  tree  look  more  condensed 
(Fig.  9.4). 


— -S-> — ■£> — 
■jw3 — 4-hi 


Fig.  9.4  Classification  tree  plot  of  the  quality  of  lofe  (QoL)  data 
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pLot( qoL_boost6j  type= "simple") 


Caution  The  plotting  of  decision  trees  will  fail  if  you  have  columns  that  start  with 
numbers  or  special  characters  (e.g.,  " 5variable ",  " [variable").  In  general,  avoid 
spaces,  special  characters,  and  other  non-terminal  symbols  in  column/row  names. 

The  next  step  would  be  making  predictions  and  testing  the  corresponding 
accuracy. 


qol_boost_pred6  <-  predict (qol_boost6}  qol_test[  }-c(40}  41)]) 
confusionMatrix( table (qol_boost_pred6j  qol_test$cd) ) 


##  Confusion  Matrix  and  Statistics 
##  qol_boost_pred6  minor_disease  severe_disease 


## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


minor_disease 

severe_disease 

Accuracy 
95%  Cl 

No  Information  Rate 
P-Value  [Acc  >  NIR] 

Kappa 

Mcnemar's  Test  P-Value 
Sensitivity 
Specificity 
Pos  Pred  Value 
Neg  Pred  Value 
Prevalence 
Detection  Rate 
Detection  Prevalence 
Balanced  Accuracy 

'Positive'  Class 


140  75 

83  145 


0.6433 

( 0.5968 ,  0.688) 
0.5034 
1 . 987 e -09 
0.2868 
0.5776 
0.6278 
0.6591 
0.6512 
0.6360 
0.5034 
0.3160 
0.4853 
0.6434 


minor  disease 


The  accuracy  is  about  64%.  However,  this  may  vary  each  time  we  run  the 
experiment  (mind  the  confidence  interval).  In  some  studies,  the  trials  option  pro¬ 
vides  significant  improvement  to  the  overall  accuracy.  A  good  choice  for  this  option 
is  trials  =  10. 


9.4.6  Loading  the  Misclassification  Error  Matrix 

Suppose  we  want  to  reduce  the  false  negative  rate ,  in  this  case,  misclassifying  a 
severe  case  as  minor.  False  negative  (failure  to  detect  a  severe  disease  case)  may  be 
more  costly  than  false  positive  (misclassifying  a  minor  disease  case  as  severe). 
Misclassification  errors  can  be  expressed  as  a  matrix: 
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error_cost<-matrix(c(0j  1,  4,  0) ,  nrow  =  2) 
error_cost 

##  [,1]  [j2] 

##  [lj]  0  4 

##  [2/]  1  0 


Let’s  build  a  decision  tree  with  the  option  cpsts=error_cost. 


set, seed (1234) 

qoL_cost<-C5.0(qoL_train[ -c(40j  41) j,  qoL_train$cdj  costs=error_cost) 
qoL_cost_pred< -predict (qoL_costj  qoi_test) 
confusionMatrix( table (qoL_cost_predj  qoi_test$cd) ) 


##  Confusion  Matrix  and  Statistics 
## 

## 


##  qol_cost_pred  minor_disease  severe_disease 


## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


minor_disease 

severe_disease 

Accuracy 
95%  Cl 

No  Information  Rate 
P-VaLue  [Acc  >  NIR] 

Kappa 

Mcnemar's  Test  P-VaLue 

Sensitivity 
Specificity 
Pos  Pred  Value 
Neg  Pred  Value 
Prevalence 
Detection  Rate 
Detection  Prevalence 
Balanced  Accuracy 

'Positive'  Class 


60  17 

163  203 

0.5937 

( 0.5463 ,  0.6398) 
0.5034 
8. 352e-05 


0.1909 
<  2.2e-16 

0.2691 

0.9227 

0.7792 

0.5546 

0.5034 

0.1354 

0.1738 

0.5959 


minor  disease 


Although  the  overall  accuracy  decreased,  the  false  negative  cell  labels  were 
reduced  from  75  (without  specifying  a  cost  matrix)  to  17  (when  specifying  a 
non-trivial  (loaded)  cost  matrix).  This  comes  at  the  cost  of  increasing  the  rate  of 
false-positive  labeling  (minor  disease  cases  misclassified  as  severe). 


9.4.7  Parameter  Tuning 


There  are  multiple  choices  to  plot  trees  fitted  by  rpart,  C50. 
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Fig.  9.5  Decision  tree  classification  of  the  QoL  data 


Library ( "rpart" ) 

#  remove  CHRONICDISEASESCORE ,  but  keep  *cd*  label 
set.seed(1234) 

qo L_modeL<- rpart ( cd~. j  data=qoL_train[j  -40] j  cp=0.01) 

#  here  we  use  rpart::cp  =  Complexity  parameter*  =  0.01 
qoL_modeL 

##  n=  1771 
## 

##  node)j  spLitj  n ,  toss,  yvc/t,  (yprob) 

##  *  denotes  terminal  node 

## 

##  1)  root  1771  836  minor_disease  (0.5279503  0.4720497) 

##  2)  CHARLSONSCORE<  0.5  665  180  minor_disease  (0.7293233  0.2706767)  * 

##  3)  CHARLSONSCORE>=0 . 5  1106  450  severe_disease  (0.4068716  0.5931284) 

##  6)  AGE<  47.5  165  65  minor_disease  (0.6060606  0.3939394)  * 

##  7)  AGE>=47 . 5  941  350  severe_disease  (0.3719447  0.6280553)  * 

You  can  also  plot  directly  using  rpart  .plot  (Fig.  9.5). 

Library ( rpart .plot) 

rpart , plot (qoL_modeLj  type  =  4, extra  =  1}  clip. right . Labs  =  F) 

We  can  use  f ancyRpartPlot  (Figs.  9.6). 

Library ("rattle") 

f ancyRpartPlot (qol  mode l j  cex  =  1) 
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Fig.  9.6  Another  decision  tree  classification  of  the  QoL  data,  compare  to  Fig.  9.5 


qoL_pred< -predict (qoL_modeLj  qoL_testjtype  =  'class') 
confusionMatrix( table(qol_predj  qol_test$cd) ) 

##  Confusion  Matrix  and  Statistics 

##  qol_pred  minor_disease  severe_disease 


## 

minor_disease 

133 

## 

severe_disease 

90 

## 

## 

Accuracy 

0.6524 

## 

95%  Cl 

(0.606,  0.6967) 

## 

No  Information  Rate 

0.5034 

## 

P- Value  [Acc  >  NIR] 

1 . 759e-10 

## 

Kappa 

0.3053 

## 

Mcnemar's  Test  P-Value 

0.04395 

## 

Sensitivity 

0.5964 

## 

Specificity 

0.7091 

## 

Pos  Pred  Value 

0.6751 

## 

Neg  Pred  Value 

0.6341 

## 

Prevalence 

0.5034 

## 

Detection  Rate 

0.3002 

## 

Detection  Prevalence 

0.4447 

## 

Balanced  Accuracy 

0.6528 

## 

'Positive'  Class 

minor  disease 

64 

156 


These  results  are  consistent  with  their  counterparts  reported  using  C5 . 0.  How 
can  we  tune  the  parameters  to  further  improve  the  results?  (Fig.  9.7). 


set.seed(1234) 

control  =  rpart.control(cp  =  0.000,  xxval  =  100,  minsplit  =2) 
qol_model=  rpart(cd  ~  data  =  qol_train[  ,  -40],  control  =  control) 
plotcp( qol_model ) 
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size  of  tree 


1  3  10  18  33  54  92  109  165  296  315  374 
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Fig.  9.7  Tuning  the  decision  tree  classification  by  reducing  the  error  across  the  spectrum  of  cost- 
complexity  pruning  parameter  (cp)  and  tree  size 


print cp( qoL_modeL ) 

##  CLassification  tree: 

##  rpart(formuLa  =  cd  ~  .,  data  =  qol_train[ }  -40] ,  controL  =  controL) 
## 

##  Variables  actually  used  in  tree  construction : 


##  [1]  AGE 

CHARLSONSCORE 

INTERVIEIaIDATE 

LANGUAGE 

##  [5]  MSA_Q_01 

MSA_Q_02 

MSA_Q_03 

MSA_Q_04 

##  [9]  MSA_Q_05 

MSA_Q_06 

MSA_Q_07 

MSA_Q_08 

##  [13]  MSA_Q_09 

MSA_Q_10 

MSA_Q_11 

MSA_Q_12 

##  [17]  MSA_Q_13 

MSA_Q_14 

MSA_Q_15 

MSA_Q_16 

##  [21]  MSA_Q_17 

PH2  Q_01 

PH2_Q_02 

QOL  Q_01 

##  [25]  QOL_Q_02 

QOL_Q_03 

QOL_Q_04 

QOL_Q_05 

##  /"29j  QOL_Q_06 

QOL_Q_07 

QOL_Q_08 

QOL_Q_09 

##  [33]  QOL_Q_10 

RACE_ETHNICITY 

SEX 

TOS_Q_01 

##  /"37J  TOS_Q_02 

TOS_Q_03 

TOS_Q_04 

## 

##  Root  node  error:  836/1771  =  0.47205 
## 

##  n=  1771 
## 


## 

CP 

nsplit 

rel  error 

xerror 

xstd 

## 

1 

0.24641148 

0 

1 . 0000000 

1 . 00000 

0.025130 

## 

2 

0.04186603 

1 

0.7535885 

0.75359 

0.024099 

## 

3 

0.00717703 

2 

0.7117225 

0.71651 

0.023816 

## 

4 

0.00657895 

3 

0. 7045455 

0. 72967 

0.023920 

## 

5 

0.00598086 

9 

0.6543062 

0. 74282 

0.024020 

## 

6 

0.00478469 

14 

0.6244019 

0.74282 

0.024020 

## 

7 

0.00418660 

17 

0.6100478 

0. 75239 

0.024090 

## 

3 

0.00398724 

21 

0.5933014 

0. 75359 

0.024099 

## 

9 

0.00358852 

32 

0.5466507 

0. 75957 

0.024141 
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## 

10 

0.00318979 

41 

0.5143541 

0.77033 

0.024215 

## 

11 

0.00299043 

53 

0.4665072 

0.78110 

0.024286 

## 

12 

0.00239234 

59 

0.4485646 

0. 78469 

0.024309 

## 

13 

0.00209330 

91 

0.3708134 

0.80024 

0.024406 

## 

14 

0.00199362 

95 

0.3624402 

0.82057 

0.024522 

## 

15 

0.00191388 

108 

0.3349282 

0.83014 

0.024574 

## 

16 

0.00179426 

122 

0.2978469 

0.82416 

0.024542 

## 

17 

0.00159490 

151 

0.2416268 

0.82656 

0.024555 

## 

18 

0.00153794 

164 

0.2177033 

0.82895 

0.024567 

## 

19 

0.00149522 

171 

0.2069378 

0.83134 

0.024580 

## 

20 

0.00119617 

182 

0.1866029 

0.83134 

0.024580 

## 

21 

0.00089713 

295 

0.0514354 

0.86842 

0.024758 

## 

22 

0.00079745 

306 

0.0406699 

0.87440 

0.024783 

## 

23 

0.00071770 

309 

0.0382775 

0.87321 

0.024778 

## 

24 

0.00068353 

314 

0.0346890 

0.87321 

0.024778 

## 

25 

0.00059809 

321 

0.0299043 

0.88876 

0.024841 

## 

26 

0.00039872 

367 

0.0023923 

0.88995 

0.024846 

## 

27 

0.  00000000 

373 

0. 0000000 

0.89474 

0.024864 

Now,  we  can  prune  the  tree  according  to  the  optimal  cp,  complexity  parameter  to 
which  the  rpart  object  will  be  trimmed.  Instead  of  using  the  real  error  (e.g.,  1  —  R  , 
RMSE)  to  capture  the  discrepancy  between  the  observed  labels  and  the  model- 
predicted  labels,  we  will  use  the  xerror,  which  averages  the  discrepancy  between 
observed  and  predicted  classifications  using  cross-validation ,  see  Chap.  21.  Figs.  9.8, 
9.9,  and  9.10  show  some  alternative  decision  tree  pmnning  results. 

set. seed (1234) 

seLected_tr  <-  prune (qoL_modeLj  cp=  qoL_modeL$cptabLe[which.min(qoL_modeL$cp 
tabLe[j  "xerror"]),  "CP"]) 
fancyRpartPLot(seLected_trj  cex  =  1) 


minor_disease 
.53  .47 
100% 


[ a*}  CHARLSON3CORE  <  0.5  ® 


in 


'severedisease 
.41  .59 
62% 


AGE  <  48 


minor_disease 
.73  .27 
38% 


m 


severe_disease 
.37  63 
53% 


Rattle  20  l7-Jun-20  16:06:36  Dinov 


Fig.  9.8  Prunned  decision  tree  classification  for  the  QoL  data,  compare  to  Figs.  9.5  and  9.6 
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qoL_pred_tune<-predict(seLected_trJ  qol_test,type  =  'class') 
confusionMatrix( table (qol_pred_tunej  qol_test$cd) ) 


##  Confusion  Matrix  and  Statistics 
##  qol_pred_tune  minor_disease  severe_disease 


## 

minor_disease 

133 

## 

severe_disease 

90 

## 

## 

Accuracy 

0.6524 

## 

95%  Cl 

(0.606,  0.6967) 

## 

No  Information  Rate 

0.5034 

## 

P-Value  [Acc  >  NIR] 

1 . 759e-10 

## 

Kappa 

0.3053 

## 

Mcnemar's  Test  P-Value 

0.04395 

## 

Sensitivity 

0.5964 

## 

Specificity 

0.7091 

## 

Pos  Pred  Value 

0.6751 

## 

Neg  Pred  Value 

0.6341 

## 

Prevalence 

0.5034 

## 

Detection  Rate 

0.3002 

## 

Detection  Prevalence 

0.4447 

## 

Balanced  Accuracy 

0.6528 

## 

'Positive'  Class 

minor  disease 

64 
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The  result  is  roughly  same  as  that  of  C5 . 0.  Despite  the  fact  that  there  is  no 
substantial  classification  improvement,  the  tree-pruning  process  generates  a  graph¬ 
ical  representation  of  the  decision  making  protocol  (selected  tr)  that  is  much 
simpler  and  intuitive  compared  to  the  original  (un-pruned)  tree  (qol_model): 

fancyRpartPlot(qol_modelj  cex  =0.1) 
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rattte::fancyRpartPlot  (QoL  Data) 


Fig.  9.9  Testing  data  (QoL  dataset)  decision  tree  prediction  results  (for  chronic  disease,  CD) 
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Rattle  20 17-Jun~20  16:07:04  Dinov 
Fig.  9.10  Training  data  QoL  decision  tree  plot 


9.5  Compare  Different  Impurity  Indices 

We  can  change  split  =  " entropy "  to  "error"  or  "gini"  to  apply  an  alternative 
information  gain  index.  Experiment  with  these  setting  and  compare  the  results. 

set.seed(1234) 

qoL_modeL  =  rpart(cd  ~  .,  data=qoL_train[  ,  -40] j 
parms  =  List(spLit  =  "entropy" ) ) 
fancyRpartPLot(qoL_modeLj  cex  =  1) 

#  Modify  and  test  using  "error"  and  "gini" 

#  qol_pred<-predict(qol_modelJ  qol_test , type  =  'class') 

#  confusionMatrix(table(qol_predj  qol_test$cd) ) 


9.6  Classification  Rules 

In  addition  to  the  classification  trees  we  just  saw,  we  can  explore  classification  rules 
that  utilize  if -else  logical  statements  to  assign  classes  to  unlabeled  data.  Below 
we  review  three  classification  rule  strategies. 


9.6.1  Separate  and  Conquer 

Separate  and  conquer  repeatedly  splits  the  data  (and  subsets  of  the  data)  by  rules  that 
cover  a  subset  of  examples.  This  procedure  is  very  similar  to  the  Divide  and  conquer 
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approach.  However,  a  notable  difference  is  that  each  rule  can  be  independent,  and 
yet,  each  decision  node  in  a  tree  has  to  be  linked  to  past  decisions. 

9.6.2  The  One  Rule  Algorithm 

To  understand  the  One  Rule  (OneR)  algorithm,  we  need  to  know  about  its 
"sibling"  -  ZeroR  rule.  ZeroR  rule  means  that  we  assign  the  mode  class  to 
unlabeled  test  observations  regardless  of  its  feature  value.  The  One  rule  algorithm 
is  an  improved  version  of  ZeroR  that  uses  a  single  rule  for  classification.  In  other 
words,  OneR  splits  the  training  dataset  into  several  segments  based  on  feature 
values.  Then,  it  assigns  the  modes  of  the  classes  with  in  each  segment  to  related 
observations  in  the  unlabeled  test  data.  In  practice,  we  first  test  multiple  rules  and 
pick  the  rule  with  the  smallest  error  rate  to  be  our  One  Rule.  Remember,  these  rules 
may  be  subjective. 


9.6.3  The  RIPPER  Algorithm 


The  Repeated  Incremental  Pruning  to  Produce  Error  Reduction  algorithm  is  a 
combination  of  the  ideas  behind  decision  tree  and  classification  rules.  It  consists  of 
a  three-step  process: 

•  Grow:  add  conditions  to  a  rule  until  it  cannot  split  the  data  into  more  segments. 

•  Prune:  delete  some  of  the  conditions  that  have  large  error  rates. 

•  Optimize:  repeat  the  above  two  steps  until  we  cannot  add  or  delete  any  of  the 
conditions. 


9.7  Case  Study  2:  QoL  in  Chronic  Disease  (Take  2) 

Let’s  take  another  look  at  the  same  dataset  as  Case  Study  1  -  this  time  applying 
classification  rules.  Naturally,  we  will  skip  over  the  first  two  data  handling  steps  and 
go  directly  to  step  three. 


9.7.1  Step  3:  Training  a  Model  on  the  Data 

Let’s  start  by  using  the  OneR  ( )  function  in  the  RWeka  package.  Before  installing 
the  package  you  might  want  to  check  that  the  Java  program  in  your  computer  is  up  to 
date.  Also,  its  version  has  to  match  the  version  of  R  (i.e.,  64bit  R  needs  64bit  Java). 
The  function  OneR  ( )  has  the  following  invocation  protocol: 


m< -OneR (c las s~predictors ,  data=mydata) 


9.7  Case  Study  2:  QoL  in  Chronic  Disease  (Take  2) 
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•  class:  factor  vector  with  the  class  for  each  row  in  mydata. 

•  predictors:  feature  variables  in  mydata.  If  we  want  to  include  xl9  x2  as  predictors 
and  y  as  the  class  label  variable,  we  do  y  ~  xi  +  x2.  To  specify  a  full  model,  we  use 
this  notation:  y  ~  . ,  which  includes  all  of  the  column  variables  as  predictors. 

•  mydata:  the  dataset  where  the  features  and  labels  can  be  found. 

•  install,  packages  ("Rijeka") 

Library (RWeka) 

•  just  remove  the  CHRONICDISEASESCORE  but  keep  cd 
set.seed(1234) 

qoi_lR<-OneR(cd~. j  data=qoL[  ,  -40]) 
qoi_lR 


##  CHARLSONSCORE: 


## 

< 

-4.5 

-> 

severe_disease 

## 

< 

0.5 

-> 

minor_disease 

## 

< 

5.5 

-> 

severe_disease 

## 

< 

8.5 

-> 

minor_disease 

## 

>= 

--  8.5 

-> 

severe  disease 

##  (1453/2214  instances  correct) 


Note  that  1,453  out  of  2,214  cases  are  correctly  classified,  66%,  by  the  “one  rule”. 


9. 7.2  Step  4:  Evaluating  Model  Performance 


summary ( qoL_lR) 

## 

##  ===  Summary  === 

## 

##  Correctly  Classified  Instances 

##  Incorrectly  Classified  Instances 

##  Kappa  statistic 

##  Mean  absolute  error 

##  Root  mean  squared  error 

##  Relative  absolute  error 

##  Root  relative  squared  error 

##  Total  Number  of  Instances 

## 

##  ===  Confusion  Matrix  === 

## 

##  a  b  <--  classified  as 

##  609  549  I  a  =  minor_disease 

##  212  844  I  b  =  severe_disease 


1453 

761 

0.3206 
0.3437 
0.5863 
68.8904  % 
117.3802  % 
2214 


65.6278  % 
34.3722  % 


We  obtained  a  single  rule  that  correctly  specifies  66%  of  the  patients,  which  is  in  line 
with  the  prior  decision  tree  classification  results.  Due  to  algorithmic  stochasticity, 
it’s  normal  that  these  results  may  vary  each  time  you  run  the  algorithm,  albeit  we 
used  seed  ( 12  34 )  to  ensure  some  result  reproducibility. 
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9.7.3  Step  5:  Alternative  Model  1 

Another  possible  option  for  the  classification  rules  would  be  the  RIPPER  rule 
algorithm  that  we  discussed  earlier  in  the  chapter.  In  R  we  use  the  Java  based 
function  JR  ip  ( )  to  invoke  this  algorithm. 

JR  ip  ( )  function  has  the  same  components  as  the  OneR  ( )  function: 

m<-DRip(class~predictors ,  data=mydata) 
set, seed (1234) 

qoL_jripl<-JRip(cd~.j  data=qoL[  ,  -40]) 
qoL_jripl 

##  J RIP  rules: 

##  =========== 

##  (CHARLSONSCORE  >=  1)  and  (RACE_ETHNICITY  >=  4)  and  (AGE  >=  49)  =>  cd=seve 
re_disease  (448.0/132.0) 

##  (CHARLSONSCORE  >=  1)  and  (AGE  >=  S3)  =>  cd=severe_disease  (645.0/265.0) 

##  =>  cd=minor_disease  (1121.0/360.0) 

## 

##  Number  of  Rules  :  3 

summary ( qol_jripl ) 

##  Correctly  Classified  Instances 
##  Incorrectly  Classified  Instances 
##  Kappa  statistic 
##  Mean  absolute  error 
##  Root  mean  squared  error 
##  Relative  absolute  error 
##  Root  relative  squared  error 
##  Total  Number  of  Instances 
##  ===  Confusion  Matrix  === 

##  a  b  <--  classified  as 

##  761  397  I  a  =  minor_disease 

##  360  696  I  b  =  severe_disease 

This  JR  ip  ( )  classifier  uses  only  three  rules  and  has  a  relatively  similar  accuracy 
66%.  As  each  individual  has  unique  characteristics,  classification  in  real  world  data 
is  rarely  perfect  (close  to  100%  accuracy). 


1457 

757 

0.3158 
0.4459 
0.4722 
89.3711  % 
94.5364  % 
2214 


65.8085  % 
34.1915  % 


9.7.4  Step  5:  Alternative  Model2 

Another  idea  is  to  repeat  the  generation  of  trees  multiple  times,  predict  according  to 
each  tree’s  performance,  and  finally  ensemble  those  weighted  votes  into  a  combined 
classification  result.  This  is  precisely  the  idea  behind  random  forest  classifica¬ 
tion,  see  Chap.  15  (Figs.  9.11  and  9.12). 


require (randomForest) 
set.seed(12) 

#  rf.fit  <-  tuneRF(qol_train[  ,  -40].,  qol_train[  ,  40],  stepFactor=l . 5) 
rf.fit  <-  randomForest (cd~.  ,  data=qol_train[  ,  -40] jimportance=TRUEjntree=2 
000 j mtry=26 ) 

varlmpPLot ( rf. fit);  print ( rf. fit) 


9.7  Case  Study  2:  QoL  in  Chronic  Disease  (Take  2) 
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Fig.  9.11  Variable  importance  plots  of  random  forest  classification  of  the  QoL  CD  variable  using 
accuracy  (left)  and  Gini  index  (right)  as  evaluation  metrics 


MSE  Errors  vs.  Iterations:  Three  Models  rf.fit,  rf.fitl,  rf.fit2 


Fig.  9.12  Error  plots  of  the  random  forest  prediction  of  CD  (QoL  chronic  disease)  using  three 
different  trees  models 
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##  Coil: 

##  randomForest (formula  =  cd  ~  data  =  qol_train[j  -40] j 
importance  =  TRUE ,  ntree  =  2000 ,  mtry  =  26) 

##  Type  of  random  forest:  classification 

##  Number  of  trees:  2000 

##  No.  of  variables  tried  at  each  split:  26 
## 

##  00B  estimate  of  error  rate:  35.86 % 

##  Confusion  matrix: 

##  minor_disease  severe_disease  class . error 

##  minor_disease  576  359  0.3839572 

##  severe_disease  276  560  0.3301435 

rf.fitl  <-  randomForest (cd~.  ,  data=qol_train[  ,  -40] ,importance=TRUE,ntree= 
2000, mtry =26 ) 

rf.fit2  <-  randomForest (cd~.  ,  data=qol_train[  ,  -40],  importance=TRUE,  node 
size=5,  ntree=5000,  mtry =26) 


plot(rf.fitJlog="x",main="rf.fit  (Black),  rf.fitl  (Red),  rf.fit2  (Green)") 
points (1:5000,  rf.fitl$mse,  col="red" ,  type="l") 
points (1:5000,  rf.fit2$mse,  col="green",  type="l") 

qol_pred< -predict (rf.fit2,  qol_test,  type  =  'class') 
confusionMatrix( table (qol_pred,  qol_test$cd) ) 


##  Confusion  Matrix  and  Statistics 

##  qol_pred  minor_disease  severe_disease 


## 

minor_disease 

138 

## 

severe_disease 

85 

## 

## 

Accuracy 

0.6524 

## 

95%  Cl 

(0.606,  0.6967) 

## 

No  Information  Rate 

0.5034 

## 

P-Value  [Acc  >  NIR] 

1 . 759e-10 

## 

Kappa 

0.305 

## 

Mcnemar's  Test  P-Value 

0.2268 

## 

Sensitivity 

0.6188 

## 

Specificity 

0.6864 

## 

Pos  Pred  Value 

0.6667 

## 

Neg  Pred  Value 

0.6398 

## 

Prevalence 

0.5034 

## 

Detection  Rate 

0.3115 

## 

Detection  Prevalence 

0.4673 

## 

Balanced  Accuracy 

0.6526 

## 

'Positive'  Class 

minor  disease 

69 

151 


These  variable  importance  plots  (varplot)  show  the  rank  order  of  importance 
of  the  features  according  to  the  specific  index  (Accuracy,  left,  and  Gini,  right).  More 
information  about  random  forests  is  available  in  Chap.  15:  Improving  Model 
Performance. 


In  random  forest  (RF)  classification,  the  node  size  (node size)  refers  to  the 
smallest  node  that  can  be  split,  i.e.,  nodes  with  fewer  cases  than  the  nodesize  are 
never  subdivided.  Increasing  the  node  size  leads  to  smaller  trees,  which  may 
compromise  previous  predictive  power.  On  the  flip  side,  increasing  the  tree  size 


9.8  Practice  Problem 
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(maxnodes)  and  the  number  of  trees  (ntree)  tends  to  increase  the  predictive 
accuracy.  However,  there  are  tradeoffs  between  increasing  node-size  and  tree-size 
simultaneously.  To  optimize  the  RF  predictive  accuracy,  try  smaller  node  sizes  and 
more  trees.  Ensembling  (forest)  results  from  a  larger  number  of  trees  will  likely 
generate  better  results. 


9.8  Practice  Problem 

In  the  previous  case  study,  we  classified  the  CHRONI  CD  IS  EASES  CORE  into  two 
groups.  What  will  happen  if  we  use  three  groups?  Let’s  separate 
CHRON I  CD  I S  EAS  E  S  CORE  evenly  into  three  groups.  Recall  the  quantile  () 
function  that  we  talked  about  in  Chap.  3.  We  can  use  it  to  get  the  cut-points  for 
classification.  Then,  a  for  loop  will  help  us  split  the  variable 
CHRON  I  CD  IS  EASE  SCORE  into  three  categories. 

quanti Le( qoL$CHRONICDISEASESCOREj  probs  =  c(l/3J  2/3)) 

##  33.33333%  66.66667% 

##  1.06  1.80 

for(i  in  1:2214){ 

if( qoi$CHRONICDISEASESCORE [ i]>0. 7&qoL$CHR0NICDISEASESC0RE [ i]<2.2 ) { 
qoL$cdthree[i]=2 

} 

else  if( qoL$CHRONICDISEASESCORE [ i]>=2.2){ 
qoL$cdthree[i]=3 

} 

eLse{ 

qoL$cdthree[i]=l 

} 

} 

qoL$cdthree<-f actor (qoL$cdthreej  LeveLs=c(lj  2 ,  3),  LabeLs  = 
c("minor_disease"j  "miLd_disease"  j  "severe_disease"  )  ) 

After  labeling  the  three  categories  in  the  new  variable  cdthree,  our  job  of 
preparing  the  class  variable  is  done.  Let’s  follow  along  the  earlier  sections  in  the 
chapter  to  determine  how  well  the  tree  classifiers  and  the  rule  classifiers  perform  in 
the  three-category  case.  First,  try  to  build  a  tree  classifier  using  C5 . 0  ( )  with 
10  boost  trials.  One  small  tip  is  that  in  the  training  dataset,  we  cannot  have  column 
40  (CHRONI CDISEASESCORE),  41  (cd),  and  now  42  (cdthree)  because  they  all 
contain  class  outcome  related  variables. 
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#  qol_trainl<-qol[l : 2114,  ] 

#  qol_testl<-qol [2115: 2214 j  ] 

train_index  <-  sample(seq_len(nrow(qol) ) ,  size  =  0. 8*nrow(qol) ) 
qoL_trainl<-qoL[train_indeXj  ] 
qoL_testl<-qoL[ -train_indeXj  ] 

prop.tabLe(tabLe(qoL_trainl$cdthree) ) 

## 

##  minor_disease  miLd_disease  severe_disease 

##  0.1699605  0.6459627  0.1840768 

prop .table (tabLe(qoL_testl$cdthree) ) 

## 

##  minor_disease  miLd_disease  severe_disease 

##  0.1760722  0.6478555  0.1760722 

set. seed(1234) 

qoL_modeLl<-C5.0(qoL_trainl[  j  -c(40j  41>  42)]j  qoL_trainl$cdthreej 

triaLs=10) 

qoL_modeLl 

## 

##  Call: 

##  C5.0.defauLt(x  =  qoL_trainl[j  -c(40j  41 j  42)],  y  = 

##  qoL_trainl$cdthree,  trioLs  =  10) 

## 

##  Classification  Tree 
##  Number  of  samples:  1771 
##  Number  of  predictors :  39 
## 

##  Number  of  boosting  iterations :  10 
##  Average  tree  size:  230.5 
## 

##  Non-standard  options:  attempt  to  group  attributes 
qol_predl< -predict (qol_modellj  qol_testl ) 

confusionMatrix( table (qol_testl$cdthree,  qol_predl ) ) 

##  Confusion  Matrix  and  Statistics 
## 

##  qol_predl 

##  minor  disease  mild  disease  severe  disease 


## 

minor_disease 

12 

58 

## 

mild_disease 

23 

239 

## 

severe_disease 

3 

61 

## 

## 

Overall  Statistics 

## 

## 

Accuracy 

:  0.5982 

## 

95%  Cl 

:  ( 0.5509 , 

0.6442) 

## 

No  Information  Rate 

:  0.8081 

## 

P -Value  [Acc  >  NIR] 

:  1 

## 

## 

Kappa 

:  0.0923 

## 

Mcnemar's  Test  P-Value 

:  4. 174e-07 

## 

##  Statistics  by  Class: 
## 
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##  CLass: 

##  Sensitivity 

##  Specificity 

##  Pos  Pred  Value 

##  Neg  Pred  Value 

##  Prevalence 

##  Detection  Rate 

##  Detection  Prevalence 

##  Balanced  Accuracy 

##  Class: 

##  Sensitivity 

##  Specificity 

##  Pos  Pred  Value 

##  Neg  Pred  Value 

##  Prevalence 

##  Detection  Rate 

##  Detection  Prevalence 

##  Balanced  Accuracy 


minor_disease  Class: 

mild_disease 

0.31579 

0.6676 

0.83704 

0.4353 

0.15385 

0.8328 

0.92877 

0.2372 

0.08578 

0.8081 

0.02709 

0.5395 

0.17607 

0.6479 

0.57641 

0.5514 

severe_disease 

0.2979 

0.8384 

0.1795 

0.9096 

0.1061 

0.0316 

0.1761 

0.5681 


We  can  see  that  the  prediction  accuracy  with  three  categories  is  way  lower  than 
the  one  we  did  with  two  categories. 

Next,  try  to  build  a  rule  classifier  with  OneR  ( ) . 


set.seed(1234) 

qol_lRl<-OneR(cdthree~. j  data=qol[  ,  -c(40j  41)]) 
qol_lRl 


##  INTERVIEIaIDATE: 

##  <3.5  ->  mild_disease 

##  <  28.5  ->  severe_disease 

##  <  282.0  ->  mild_disease 

##  <  311.5  ->  severe_disease 

##  >=  311.5  ->  mild_disease 

##  ( 1436/2214  instances  correct) 
summary (qol_lRl ) 


## 

##  ===  Summary  === 

## 

##  Correctly  Classified  Instances 

##  Incorrectly  Classified  Instances 

##  Kappa  statistic 

##  Mean  absolute  error 

##  Root  mean  squared  error 

##  Relative  absolute  error 

##  Root  relative  squared  error 

##  Total  Number  of  Instances 

## 

##  ===  Confusion  Matrix  === 

## 


1436 
778 
0.022 
0.2343 
0.484 
67.5977  % 
116.2958  % 
2214 


## 

a 

b 

c 

<--  classified  as 

## 

0 

375 

4  1 

a  =  minor_disease 

## 

0 

1422 

9  1 

b  =  mild_disease 

## 

0 

390 

14  1 

c  =  severe_disease 

64.86  % 

35.14  % 


qol_predl< -predict ( qol_lRlj  qol_testl ) 

confusionMatrix( table (qol_testl$cdthree j  qol_predl ) ) 

##  Confusion  Matrix  and  Statistics 
## 
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## 

## 

## 

## 

## 

## 

##  Overall 
## 


minor_disease 
miLd_disease 
severe  disease 


qoL_predl 

minor_disease  miLd_disease  severe_ 
0  78 

0  285 

0  76 


disease 

0 

2 

2 


Statistics 


## 

## 

## 

## 

## 

## 

## 

## 

##  Statistics 
## 


Accuracy 
95%  Cl 

No  Information  Rate 
P -Value  [Acc  >  NIR] 


Mcnemar ' s 


Kappa 
Test  P-Value 

by  Class: 


0.6479 

(0. 6014 j  0.6923) 

0.991 

1 

0.012 

NA 


## 

Class:  minor_disease  Class: 

mild_disease 

##  Sensitivity 

NA 

0.64920 

##  Specificity 

0.8239 

0. 50000 

##  Pos  Pred  Value 

NA 

0.99303 

##  Nea  Pred  Value 

NA 

0.01282 

##  Prevalence 

0. 0000 

0.99097 

##  Detection  Rate 

0. 0000 

0.64334 

##  Detection  Prevalence 

0.1761 

0.64786 

##  Balanced  Accuracy 

NA 

0.57460 

## 


Class:  severe  disease 


##  Sensitivity 
##  Specificity 
##  Pos  Pred  Value 
##  Neg  Pred  Value 
##  Prevalence 
##  Detection  Rate 
##  Detection  Prevalence 
##  Balanced  Accuracy 


0 . 500000 
0.826879 
0.025641 
0.994521 
0.009029 
0.004515 
0.176072 
0.663440 


The  OneRule  classifier  that  is  purely  based  on  the  value  of  the 
INTERVIEWDATE  has  65%  internal  classification  accuracy,  and  also  65%  external 
(validation  data)  prediction  accuracy.  Although,  the  latter  assessment  is  a  bit  mis¬ 
leading,  as  the  vast  majority  of  external  validation  data  are  classified  in  only  one 
class  -  mild_disease. 

Finally,  let’s  revisit  the  JR  ip  ( )  classifier  with  the  same  three  class  labels 
according  to  cdthree. 
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set.seed(1234) 

qoL_jripl<-JRip(cdthree~. ,  data=qoL[  ,  -c(40,  41)]) 
qoL_jripl 

##  J RIP  ruLes: 

##  =========== 

##  (CHARLSONSCORE  <=  0)  and  (AGE  <=  50)  and  (MSA_Q_06  <=  1)  and 
(QOL_Q_07  >=  1)  and  (MSA_Q_09  <=  1)  =>  cdthree=minor_disease  (35.0/11.0) 
##  (CHARLSONSCORE  >=  1)  and  (QOL_Q_10  >=  4)  and  (QOL_Q_07  >=  9)  => 
cdthree=severe_disease  ( 54.0/20.0) 

##  (CHARLSONSCORE  >=  1)  and  (QOL_Q_02  >=  5)  and  (MSA_Q_09  <=  4)  and 
(MSA_Q_04  >=  3)  =>  cdthree=severe_disease  (64.0/30.0) 

##  (CHARLSONSCORE  >=  1)  and  (QOL_Q_02  >=  4)  and  (PH2_Q_01  >=  3)  and 
(QOL  Q_10  >=  4)  and  (RACE  ETHNICITY  >=  4)  =>  cdthree=severe  disease 
(43.0/19.0) 

##  =>  cdthree=miLd_disease  (2018.0/653.0) 

## 

##  Number  of  RuLes  :  5 
summary ( qoL_jripl ) 


##  ===  Summary  === 

## 

##  CorrectLy  CLassified  Instances  1481 

##  IncorrectLy  CLassified  Instances  733 

##  Kappa  statistic  0.1616 

##  Mean  absoLute  error  0.3288 

##  Root  mean  squared  error  0.4055 

##  ReLative  absoLute  error  94.8702 

##  Root  reLative  squared  error  97.42 

##  TotaL  Number  of  Instances  2214 

##  ===  Confusion  Matrix  === 

## 


## 

a 

b 

c 

<--  classified  as 

## 

24 

342 

13  1 

a  =  minor_disease 

## 

10 

1365 

56  1 

b  =  mild_disease 

## 

1 

311 

92  1 

c  =  severe_disease 

66.8925  % 
33.1075  % 


qoL_predl< -predict (qoL_jriplj  qoL_testl ) 

confusionMatrix( table (qoL_testl$cdthree j  qoL_predl ) ) 

##  Confusion  Matrix  and  Statistics 
## 

##  qoL_predl 

##  minor  disease  miLd  disease  severe  disease 


## 

minor_disease 

5 

70 

## 

miLd_disease 

2 

275 

## 

severe_disease 

0 

61 

## 

## 

Overall  Statistics 

## 

Accuracy  : 

0.6704 

## 

95%  Cl  : 

(0.6245, 

0.7141) 

## 

No  Information  Rate  : 

0.9165 

## 

P -Value  [Acc  >  NIR]  : 

1 

## 

## 

Kappa  : 

0.1583 

## 

Mcnemar's  Test  P-VaLue  : 

<2e-16 

## 

##  Statistics  by  Class: 
## 


Class:  minor  disease  Class:  mild  disease 
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##  Sensitivity 

##  Specificity 

##  Pos  Pred  Value 

##  Neg  Pred  Value 

##  Prevalence 

##  Detection  Rate 

##  Detection  Prevalence 

##  Balanced  Accuracy 

##  Class: 

##  Sensitivity 

##  Specificity 

##  Pos  Pred  Value 

##  Neg  Pred  Value 

##  Prevalence 

##  Detection  Rate 

##  Detection  Prevalence 

##  Balanced  Accuracy 


0.71429 

0.6773 

0.83257 

0.6757 

0.06410 

0.9582 

0.99452 

0.1603 

0.01580 

0.9165 

0.01129 

0.6208 

0.17607 

0.6479 

0. 77343 

0.6765 

severe_disease 
0.56667 
0.85230 
0.21795 
0.96438 
0.06772 
0.03837 
0.17607 
0.  70948 


In  terms  of  the  predictive  accuracy  on  the  testing  data  (qol_t  est  1  $cd  three), 
we  can  see  from  these  outputs  that  the  RIPPER  algorithm  performed  better  (67%) 
than  the  C5 . 0  decision  tree  (60%)  and  similarly  to  the  OneR  algorithm  (65%), 
which  suggests  that  simple  algorithms  might  outperform  complex  methods  for 
certain  real  world  case-studies.  Later,  in  Chap.  15,  we  will  provide  more  details 
about  optimizing  and  improving  classification  and  prediction  performance. 

Try  to  replicate  these  results  with  other  data  from  the  list  of  our  Case-Studies. 


9.9  Assignments  9:  Decision  Tree  Divide  and  Conquer 
Classification 

9.9.1  Explain  These  Concepts 

•  Information  Gain  Measure 

•  Impurity 

•  Entropy 

•  Gini 


9.9.2  Decision  Tree  Partitioning 

Use  the  SOCR  Neonatal  Pain  data  to  build  and  display  a  decision  tree  recursively 
partitioning  the  data  using  the  provided  features  and  attributes  to  split  the  data  into 
similar  classes. 

•  Collect  and  preprocess  the  data,  e.g.,  data  conversion  and  variable  selection. 

•  Randomly  split  the  data  into  training  and  testing  sets. 

•  Train  decision  tree  models  on  the  data  using  C5.0  and  rpart. 
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•  Evaluate  and  compare  the  two  models. 

•  Tune  the  rpart  parameter  and  repeat  the  evaluation  and  comparison  again. 

•  Assess  the  prediction  accuracy  and  report  the  confusion  matrix. 

•  Comment  on  different  aspects  of  the  prediction  performance. 

•  Use  various  impurity  measures  and  re-estimate  the  models. 

•  Try  to  use  the  RWeka  package  to  train  decision  models  and  compare  the  results. 

•  Try  to  apply  Random  Forest  and  obtain  variables  importance  plot. 
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Chapter  10 

Forecasting  Numeric  Data  Using  Regression 
Models 


® 

Check  for 
updates 


In  the  previous  Chaps.  7,  8,  and  9,  we  covered  classification  methods  that  use 
mathematical  formalism  to  address  everyday  life  prediction  problems.  In  this  Chap¬ 
ter,  we  will  focus  on  specific  model-based  statistical  methods  providing  forecasting 
and  classification  functionality.  Specifically,  we  will  (1)  demonstrate  the  predictive 
power  of  multiple  linear  regression;  (2)  show  the  foundation  of  regression  trees  and 
model  trees;  and  (3)  examine  two  complementary  case-studies  (Baseball  Players  and 
Heart  Attack). 

It  may  be  helpful  to  first  review  Chap.  5  (Linear  Algebra/Matrix  Manipulations) 
and  Chap.  7  (Introduction  to  Machine  Learning). 


10.1  Understanding  Regression 

Regression  is  a  measurement  of  relationship  between  a  dependent  variable  (value  to 
be  predicted)  and  a  group  of  independent  variables  (predictors  similar  to  features, 
discussed  in  Chap.  7).  We  assume  the  relationship  between  our  dependent  variable 
and  independent  variables  follows  a  predefined  model,  e.g.,  an  affine  or  hyper-linear 
model. 


10.1.1  Simple  Linear  Regression 

First  recall  the  material  presented  in  Chap.  5  {Linear  Algebra  &  Matrix  Computing). 
The  simplest  case  of  regression  modeling  involves  a  single  predictor. 

y  =  a  +  bx. 


©  Ivo  D.  Dinov  2018 
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Fig.  10.1  Scatterplot  and  a  linear  model  of  length  of  stay  (LOS)  vs.  hospital  charges  for  the  heart 
attack  data 

This  formula  should  appear  familiar  by  now.  In  this  slope-intercept  analytical 
expression,  a  is  our  intercept  while  b  is  the  slope.  That  is  an  equation  form  of  the 
simple  linear  regression  model.  If  we  know  a  and  b ,  for  any  given  v  (input)  we  can 
predict  y  (output)  via  the  above  formula.  If  we  plot  v  and  y  in  a  2D  coordinate 
system,  the  model  is  graphically  represented  as  a  straight  line. 

However,  this  is  the  ideal  case.  When  we  plot  using  real  world  data,  the  pattern  may 
be  harder  to  recognize.  Let’s  look  at  the  scatter  plot  (see  Chap.  3)  and  simple  linear 
regression  line  of  two  variables  “hospital  charges”  or  CHARGES  (independent  variable) 
and  length  of  stay  in  the  hospital  or  LOS  (predictor).  The  data  is  available  online, 
CaseStudyl2_AdultsHeartAttack_Data.  We  removed  two  observations 
that  have  missing  data  using  the  command  heart_attack<-heart_attack 
[complete . cases (heart_attack) ,  ]. 

heart_attack<- 

read. csv( "https : //umich .instructure . com/fiLes/1644953/downLoad?downLoad_frd 
=l"j  st ringsAs Factors  =  F)  heart_attack$CHARGES< -as. numeric 
( heart_attack$CHARGES ) 

##  Warning:  NAs  introduced  by  coercion 

heart_attack<-heart_attack[compiete .cases (heart_attack) j  ] 

fitl<-Lm( CHARGES~LOSj  data=heart_attack ) 
par( cex=.8) 

piot(heart_attack$LOSj  heart_attack$CHARGESj  xLab="LOS"j  yLab  =  "CHARGES" ) 
abiine(fitlj  Lwd=2) 

It  seems  to  be  common  sense  that  the  longer  you  stay  in  the  hospital,  the  higher 
the  medical  costs  will  be.  However,  on  the  scatter  plot,  we  have  only  a  bunch  of  dots 
showing  some  sign  of  an  increasing  pattern  (Fig.  10.1). 
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The  estimated  expression  for  this  regression  line  is: 

y  =  4582.70  +  212.29  xx, 

or  equivalently 

CHARGES  =  4582.70  +  212.29  x  LOS. 

It  is  simple  to  make  predictions  with  this  regression  line.  Assume  we  have  a  patient 
that  spent  1 0  days  in  hospital,  then  we  have  LOS = 1 0 .  The  predicted  charge  is  likely  to 
be  $4582.70  +  $212.29  x  10  =  $6705.6.  Plugging  x  into  the  expression  equation 
automatically  gives  us  an  estimated  value  of  the  outcome  y.  This  Chapter  of  the 
Probability  and  statistics  EBook  provides  an  introduction  to  linear  modeling  (http:// 
wiki.socr.umich.edu/index.php/EBook#Chapter_X:_Correlation_and_Regression). 


10.2  Ordinary  Least  Squares  Estimation 

How  did  we  get  the  estimated  expression?  The  most  common  estimating  method  in 
statistics  is  ordinary  least  squares  (OLS).  OLS  estimators  are  obtained  by  minimizing 
the  sum  of  the  squared  errors  -  that  is  the  sum  of  squared  vertical  distances  between 
each  point  on  the  scatter  plot  and  its  predicted  value  on  the  regression  line  (Fig.  10.2). 


Fig.  10.2  Graphical  representation  of  the  residuals  representing  the  difference  between  observed 
and  predicted  values 
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OLS  is  minimizing  the  following  formula: 


n 


n 


n 


^2  (■ yi  -yd  =Y1  (y‘  -(a+bx  x‘ ))2  =  ^2e2i 


i=  1 


i=  1 


i=  1 


Some  simple  mathematical  operations  to  minimize  the  sum  square  error  yield  the 
following  solution  for  the  slope  parameter  b: 


b  = 


E  (*i  -  x)  ( yi  -  y ) 


_\  2 


E  (■ xi  -  x) 


While  the  intercept  a  is  given  by: 


a  =  y  —  bx. 

Recall  what  we  learned  in  Chap.  3,  where  the  variance  was  obtained  by  averaging 
the  sum  of  squares  ^  var{x )  =  -  ^^i_l  ~  b)2^J  .  When  we  use  x  to  estimate  the 

mean  of  x,  we  have  the  following  formula  for  the  sample  variance: 

var(x)  = - -  (xi  —  x)2 .  We  can  see  that  this  is  times  the  denominator 

of  b.  Similar  to  the  variance,  the  covariance  of  x  and  y  measures  the  average  sum  of 
the  x  deviance  times  the  y  deviance. 


1  n 

Cov(x,  y)=- 22  (.Xi  -  fix)  (yt  -  fly) 
TT  < 

i—  1 


If  we  use  sample  averages  (x,  y),  we  have:  Cov(x,y)  =  — — -  (x/  —  x) 

n  —  1  ^ v  y 

(y;-  —  y) .  This  is  times  the  numerator  of  b. 

Combining  the  above,  we  get  an  estimate  of  the  slope  coefficient  (effect-size  of 
LOS  on  Charge): 


b  = 


Cov(x,y) 

var(x) 


Let’s  examine  these  closed-form  analytical  expressions  using  the  heart 
attack  data. 


b< - cov (heart_attack$LOS ,  heort_attack$CHARGES )/var(heart_attack$LOS) ;  b 
##  [1]  212.2869 

a<-mean(heart_attacb$CHARGES ) -b*mean(heart_attack$LOS) ;  a 
##  [1]  4582.7 

We  can  see  that  these  estimates  are  exactly  the  same  result  as  the  previously 
reported. 
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10.2.1  Model  Assumptions 

Regression  modeling  has  the  following  five  key  assumptions: 

•  Linear  relationship, 

•  Multivariate  normality, 

•  No  or  little  multicollinearity, 

•  No  auto-correlation,  independence, 

•  Homoscedasticity. 


10.2.2  Correlations 


The  SOCR  Interactive  Scatterplot  Game  (requires  Java  enabled  browser)  provides  a 
dynamic  interface  demonstrating  linear  models,  trends,  correlations,  slopes,  and 
residuals. 

Using  the  covariance,  we  can  calculate  the  correlation,  which  indicates  how 
closely  the  relationship  between  two  variables  follows  a  straight  fine. 


Px,y  =  Corr(x,  y) 


Cov(x,y)  Cov(x,y) 

&x  &y  ^/Var(x)Var(y) 


In  R,  correlation  is  given  by  cor()  while  square  root  of  variance,  or  standard 
deviation,  is  given  by  sd(). 


r< -  com (heart_attack$LOS j  heart_attack$CHARGES )/( sd(heart_attack$LOS ) * 
sd(heart_attack$CHARGES) ) 
r 

##  [1]  0.2449743 

cor (heort_attock$LOS j  heart_attack$CHARGES ) 

##  [1]  0.2449743 

The  same  outputs  are  obtained  by  the  manual  and  the  automated  correlation 
calculations.  This  correlation  is  a  positive  number  that  is  relatively  small.  We  can 
say  there  is  a  weak  positive  linear  association  between  these  two  variables.  If  we 
have  a  negative  correlation  estimate,  it  suggests  a  negative  linear  association.  We 
have  a  weak  association  when  0.1  <  Cor  <  0.3,  a  moderate  association  for 
0.3  <  Cor  <  0.5,  and  a  strong  association  for  0.5  <  Cor  <  1.0.  If  the  correlation  is 
below  0.1  then  it  suggests  little  to  no  linear  relation  between  the  variables. 
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10.2.3  Multiple  Linear  Regression 

In  practice,  most  interesting  problems  involve  multiple  predictors  and  one  dependent 
variable,  which  requires  estimating  a  multiple  linear  model.  That  is: 


y  —  CC  +  fi\X\  +  P2x2  T  •  •  •  T  T  c, 


or  equivalently 


y  —  A)  +  P\xi  +  P2x2  +  •  •  •  +  Pkxk  +  6. 


We  usually  use  the  second  notation  method  in  statistics.  This  equation  shows  the 
linear  relationship  between  k  predictors  and  a  dependent  variable.  In  total  we  have 
k  +  1  coefficients  to  estimate. 

The  matrix  notation  for  corresponding  to  the  above  equation  is: 


where 


and 


Y  =  X/3  +  e, 

(  yx  \ 

Y=  , 

>’«  / 

■^11  x2l  •  •  •  xk\  \ 

x\2  x22  •  •  •  xk2 

xln  x2n  •  •  •  xkn  ) 


(h\ 


f  Cl\ 

e2 


is  the  error  term. 
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Similar  to  simple  linear  regression,  our  goal  is  to  minimize  sum  of  squared  errors. 
Solving  for  /?,  we  get: 


p  =  (xtx)~1xty. 

This  is  the  matrix  form  solution,  where  X~ 1  is  the  inverse  matrix  of  X  and  XT  is  the 
transpose  matrix. 

Let’s  write  a  simple  R  function  reg  (x,y),  that  implements  this  matrix  formula. 

reg< -function (y}  x){ 
x<-as.matrix(x) 
x<-cbind(Intercept=lj  x) 
solve  (t(x)%*%x)%*%t(x)%*%y 

} 


The  method  solve  ( )  is  used  to  compute  the  matrix  inverse  and  %*%  is  matrix 
multiplication. 

Next,  we  will  apply  this  function  to  our  heart  attack  dataset.  To  begin,  let’s  check 
if  the  simple  linear  regression  output  is  the  same  as  we  calculated  earlier. 

reg(y=heart_attack$CHARGESj  x=heart_attack$LOS ) 

##  [,l] 

##  Intercept  4582.6997 
##  212.2869 

As  the  slope  and  intercept  and  consistent  with  our  previous  estimates,  we  can 
continue  and  include  additional  variables  as  predictors.  For  instance,  we  can  just  add 
age  into  the  model. 


str (heart_attack) 


##  ' data. frame ' : 

##  $  Patient  :  int 

##  $  DIAGNOSIS:  int 

41041  . . . 

##  $  SEX  :  chr 

##  $  DRG  :  int 

##  $  DIED  :  int 

##  $  CHARGES  :  num 

##  $  LOS  :  int 

##  $  AGE  :  int 


148  obs.  of  8  variables: 

123456789  10... 

41041  41041  41091  41081  41091  41091  41091  41091  41041 
"F"  "F"  "F"  "F"  ... 

122  122  122  122  122  121  121  121  121  123  ... 
0000000001  ... 

4752  3941  3657  1481  1681  ... 

10  6  5  2  1  9  15  15  2  1  ... 

79  34  76  80  55  84  84  70  76  65  ... 


reg(y=heart_attack$CHARGESj  x=heart_attack[j  c(7}  8)]) 

##  [jl] 

##  Intercept  7280.55493 
##  LOS  259.67361 

##  AGE  -43.67677 
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10.3  Case  Study  1:  Baseball  Players 
10.3.1  Step  1:  Collecting  Data 

We  utilize  the  MLB  data  "01a_data.txt".  The  dataset  contains  1034  records  of 
heights  and  weights  for  some  current  and  recent  Major  League  Baseball  (MLB) 
Players.  These  data  were  obtained  from  different  resources  (e.g.,  IBM  Many  Eyes). 
This  dataset  includes  the  folloing  variables: 

•  Name:  MLB  Player  Name, 

•  Team:  The  Baseball  team  the  player  was  a  member  of  at  the  time  the  data  was 
acquired, 

•  Position:  Player  field  position, 

•  Height:  Player  height  in  inch, 

•  Weight:  Player  weight  in  pounds,  and 

•  Age:  Player  age  at  time  of  record. 


10.3.2  Step  2:  Exploring  and  Preparing  the  Data 

Let’s  load  this  dataset  first.  We  use  as  .  is=T  to  make  non-numerical  vectors  into 
characters.  Also,  we  delete  the  Name  variable  because  we  don’t  need  players’  names 
in  this  case  study. 


mlb<-  read. tabLe( ' https : //umich .instructure . com/fiLes/330381/downLoad?downLo 

ad_frd=l'j  as.is=Tj  header=T) 

str(mLb) 

##  'data,  frame' :  1034  obs.  of  6  variables : 

##  $  Name  :  chr  "Adam_Donachie"  "Paul_Bako"  "Ramon_Hernandez" 

"Kevin_MiLLar"  ... 

##  $  Team  :  chr  "BAL "  "BAL "  "BAL "  "BAL "  ... 

##  $  Position:  chr  "Catcher"  "Catcher"  "Catcher"  "First_Baseman"  ... 

##  $  Height  :  int  74  74  72  72  73  69  69  71  76  71  ... 

##  $  Neight  :  int  180  215  210  210  188  176  209  200  231  180  ... 

##  $  Age  :  num  23  34.7  30.8  35.4  35.7  ... 

mLb<-mLb[j  - 1 ] 

By  looking  at  the  srt  ( )  output,  we  notice  that  the  variable  TEAM  and  Posi¬ 
tion  are  misspecified  as  characters.  To  fix  this,  we  can  use  the  function  as. 
f  actor  ( )  to  convert  numerical  or  character  vectors  to  factors. 


m Lb$Team< -as. factor (mib$Team) 
m Lb$Position< -as .factor (mlb$Position ) 

The  data  is  good  to  go.  Let’s  explore  it  using  some  summary  statistics  and  plots 
(Fig.  10.3). 
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Histogram  for  Weights 


J L  L ■ 

I  I  T  1  I  I  I 

160  180  200  220  240  260  280 

mlbSWeight 

s  ummary  (mLb$Height) 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  150.0  187.0  200.0  201.7  215.0  290.0 

hist(mLb$lAleightj  main  =  "Histogram  for  Heights ") 

The  above  plot  illustrates  our  dependent  variable  Weight.  As  we  learned  in 
Chap.  3,  this  distribution  appears  somewhat  right-skewed  (Fig.  10.3). 

Applying  GGpairs  to  obtain  a  compact  dataset  summary  we  can  mark  heavy 
weight  and  light  weight  players  (according  to  light  <  median  <  heavy )  by  different 
colors  in  the  plot  on  Fig.  10.4 

require (GGaLLy) 
mlbjoinary  =  mLb 

mLb_binary$bi_vjeight  = 

as .factor ( ifeise(mLb_binary$Height>median(mib_binary$Height)j  1,0)) 
g_weight  <-  ggpairs(data=mLb_binary[-l]j  titLe="MLB  Light/Heavy  Heights' 
mapping=ggpiot2: :aes(coLour  =  bi_weight) j 
Lower=List( combo=wrap( "facethist" ,  binwidth=l ))) 

g_weight 

Next,  we  may  also  mark  player  positions  by  different  colors  in  the  plot 
(Fig.  10.5). 

g_position  <-  ggpairs(data=mib[ -1] j  titie="MLB  by  Position" 

mapping=ggpiot2: :aes (colour  =  Position) j 
Lower=List( combo=wrap( "facethist" j binwidth=l ))) 

g_position 


10.3  Case  Study  1:  Baseball  Players 

Fig.  10.3  Frequency 
histogram  of  the  MLB 
player’s  weights 
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What  about  potential  predictors? 
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MLS  LightfHeavy  Weights 


Fig.  10.4  Pair  plots  of  the  MLB  data  by  player’s  light  (red)  or  heavy  (blue)  weights 


table (mib$Team) 

##  ANA  ARZ  ATI  BAL  BOS  CHC  CIN  CLE  COL  CMS  DET  FLA  HOU  KC  LA  MIN  ML  IN  NYM 

##  35  28  37  35  36  36  36  35  35  33  37  32  34  35  33  33  35  38 

##  NYY  OAK  PHI  PIT  SD  SEA  SF  STL  TB  TEX  TOR  MAS 

##  32  37  36  35  33  34  34  32  33  35  34  36 

table (mLb$Position) 


## 


## 

Catcher 

Designated_Hitter 

First_Baseman 

Outfielder 

## 

76 

18 

55 

194 

## 

Relief _Pitcher 

Second_Baseman 

Shortstop 

Starting_Pitcher 

## 

315 

58 

52 

221 

## 

Third_Baseman 

## 

45 

summary  (mLb$Height) 
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MLB  by  Pasiltom 


Cor :  0.S3 
Caicher.  0.437 
Designalud-Hiiiar  0.621 
Firs1„Bnsoman  0  565 

Quirbekfer.  0.62a 

RtHnri_Pv1tlKtr  0  499 
?orood  BesorrVni.:  0.264 

Shortstop:  0.397 
Stastina_Pi1cher.  0.443 
Thjrrl  B^onNin  0  749 


Cor:  -0.0737 
Caldwr.  O.D274 
3«skjr»n!Ml_hll(t.T:  0  425 

Ftrs^flosoonsn:  -0.0545 
OulfiAlder  0.113 
BnlinF_Pilcl-M?r:  43  0707 
oorwJ  -fl  0713 

Shodslop:  -0  2D9 
5larUng_PiLoher:  -O.D445 
Tlhirsl^Bssoinan:  0  153 


Cor:  0.153 
Catcher  0  219 
DfithjJIUted-HilCfrf.  0.665 
Fi  iT:t_aa5C-innn;  -0,0764 
OuffioWer:  0.179 
Roiief_Pilcher:  0.163 
iecondjaaseman:  0  457 
Shortstop:  0  245 
Slart<rg_Pitcher:  0.0673 
Tti»rd_Baaeman:  0.302 


Fig.  10.5  Pair  plots  of  the  MLB  data  by  position  type 


## 

Min. 

1st  Qu. 

Median 

Mean 

3rd  Qu. 

Max. 

## 

67.0 

72.0 

74.0 

73.7 

75.0 

83.0 

summary (mLb$Age) 

## 

Min. 

1st  Qu. 

Median 

Mean 

3rd  Qu. 

Max. 

## 

20.90 

25.44 

27.92 

28.74 

31.23 

48.52 

In  this  case,  we  have  two  numerical  predictors,  two  categorical  predictors  and 
1,034  observations.  Let’s  see  how  R  treats  different  classes  of  variables. 
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10.3.3  Exploring  Relationships  Among  Features: 

The  Correlation  Matrix 

Before  fitting  a  model,  let’s  examine  the  independence  of  our  potential  predictors 
and  the  dependent  variable.  Multiple  linear  regression  assumes  that  predictors  are  all 
independent  of  each  other.  Is  this  assumption  valid?  As  we  mentioned  earlier,  the 
cor  ( )  function  can  answer  this  question  in  pairwise  manner.  Note  that  we  only 
look  at  numerical  variables. 

cor(mLb[c( "Weight" ,  "Height",  "Age")]) 

##  Weight  Height  Age 

##  Weight  1.0000000  0.53031802  0.15784706 

##  Height  0.5303180  1.00000000  -0.07367013 

##  Age  0.1578471  -0.07367013  1.00000000 

Observe  that  cor(y , x)  =  cor(x , y)  and  cov(x, x)  =  1 .  Also,  our  Height  variable  is 
weakly  related  to  the  players’  age  in  a  negative  manner.  This  looks  very  good  and 
wouldn’t  cause  any  multicollinearity  problem.  If  two  of  our  predictors  are  highly 
correlated,  they  both  provide  almost  the  same  information,  which  could  imply 
multicollinearity.  A  common  practice  is  to  delete  one  of  them  in  the  model  or  use 
dimensionality  reduction  methods. 


10.3.4  Visualizing  Relationships  Among  Features: 

The  Scatterplot  Matrix 

To  visualize  pairwise  correlations,  we  could  use  scatterplot  or  pairs  ()  plot 
(Fig.  10.6). 

pairs(mLb[c(  "Weight",  "Height",  "Age")]) 

You  might  get  a  sense  of  the  data,  but  it  is  difficult  to  see  any  linear  pattern.  We 
can  make  a  more  sophisticated  graph  using  pairs,  pane  Is  ()  in  the  psych 
package  (Fig.  10.7). 

#  install. packages("psych") 

Library (psych) 

pairs. pane  is (m Lb [,  c("Weight" ,  "Height",  "Age")]) 

This  plot  provides  much  more  information  about  the  three  variables.  Above  the 
diagonal,  we  have  our  correlation  coefficients  in  numerical  form.  On  the  diagonal,  there 
are  histograms  of  variables.  Below  the  diagonal,  visual  information  is  presented  to  help 
us  understand  the  trend.  This  specific  graph  shows  that  height  and  weight  are  positively 
and  strongly  correlated.  Also,  the  relationships  between  age  and  height,  as  well  as,  age 
and  weight  are  very  weak,  see  the  horizontal  red  line  in  the  panel  below  the  main 
diagonal  graphs,  which  indicates  weak  relationships  (Fig.  10.7). 
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Fig.  10.6  MLB  players  weights,  heights  and  ages 
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Fig.  10.7  A  more  detailed  pairs  plot  of  MLB  players  weights,  heights  and  ages 
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10.3.5  Step  3:  Training  a  Model  on  the  Data 

The  function  we  are  going  to  use  now  is  lm  ( ) .  No  additional  package  is  needed 
when  using  this  function. 

The  lm  ( )  function  has  the  following  components: 

m<-lm(dv  ~  iv,  data=mydata) 

•  dv:  dependent  variable 

•  iv:  independent  variables.  Just  like  OneR  ( )  in  Chap.  9,  if  we  use  .  as  iv,  then  all 
of  the  variables,  except  the  dependent  variable  (dv),  are  included  as  predictors. 

•  data:  specifies  the  data  containing  both  dependent  viable  and  independent 
variables. 

fit<- Lm(lAleight~.  j  data=mLb) 
fit 

## 

##  Call: 

##  Lm( formula  =  height  ~  data  =  mlb) 

## 

##  Coefficients : 


## 

( Intercept) 

TeamARZ 

## 

-164.9995 

7.1881 

## 

Team  ATI 

TeamBAL 

## 

-1.5631 

-5.3128 

## 

Team BOS 

TeamCHC 

## 

-0.2838 

0.4026 

## 

TeamCIN 

TeamCLE 

## 

2.1051 

-1.3160 

## 

TeamCOL 

TeamClA/S 

## 

-3.7836 

4.2944 

## 

TeamDET 

TeamFLA 

## 

2.3024 

2.6985 

## 

TeamHOU 

TeamKC 

## 

-0.6808 

-4.7664 

## 

TeamLA 

TeamMIN 

## 

2.8598 

2.1269 

## 

TeamMLlAl 

TeamNYM 

## 

4.2897 

-1.9736 

## 

TeamNYY 

TeamOAK 

## 

1 . 7483 

-0.5464 

## 

Team PH I 

TeamPIT 

## 

-6.8486 

4.3023 

## 

TeamSD 

TeamS E A 

## 

2.6133 

-0.9147 

## 

TeamS F 

TeamSTL 

## 

0.8411 

-1.1341 

## 

TeamTB 

TeamTEX 

## 

-2.6616 

-0.7695 

## 

TeamTOR 

TeamHAS 

## 

1 . 3943 

-1.7555 

## 

PositionDesignated_Hitter 

Position First_Baseman 

## 

8.9037 

2.4237 

## 

PositionOut fielder 

PositionRelief_Pitcher 

## 

-6.2636 

-7.7695 

10.3 

Case  Study  1:  Baseball  Players 

## 

PositionSecond_Baseman 

PositionShortstop 

## 

-13.0843 

-16.9562 

## 

PositionStarting_Pitcher 

PositionThird_Baseman 

## 

-7.3599 

-4.6035 

## 

Height 

Age 

## 

4.7175 

0.8906 

As  we  can  see  from  the  output,  factors  are  included  in  the  model  by  creating 
several  indicators,  one  for  each  factor  level.  For  each  numerical  variable,  a 
corresponding  model  coefficient  is  estimated. 


10.3.6  Step  4:  Evaluating  Model  Performance 

As  we  did  in  previous  case-studies,  let’s  examine  the  model  performance  (Figs.  10.8 
and  10.9). 


Residuals  vs  Fitted 


160  180  200  220  240  260 

Fitted  values 
Im (Weight  ~ .) 

Fig.  10.8  Scatterplot  of  the  residuals  vs.  model  fitted  values 


w  Normal  Q-Q 


Theoretical  Quantiles 
I m  (Weight  ~  ,) 


Fig.  10.9  QQ-normal  plot  of  the  residuals  suggesting  a  linear  model  may  explain  the  players’ 
weight 
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summary (fit) 
## 

##  CaiL: 


## 

Lm(formuLa  =  height  ~  i 

data  =  mlb) 

## 

## 

Residuals: 

## 

Min  IQ  Median 

3Q  Max 

## 

-48.692  -10.909  -0.778 

9.858  73.649 

## 

## 

Coefficients : 

## 

Estimate  Std. 

Error 

t  value  Pr( > /t / ) 

## 

(Intercept) 

-164.9995  19.3828 

-8.513 

<  2e-16 

*  *  * 

## 

TeamARZ 

7.1881 

4.2590 

1.688 

0.091777 

• 

## 

TeamATL 

-1.5631 

3.9757 

-0.393 

0.694278 

## 

TeamBAL 

-5.3128 

4.0193 

-1.322 

0.186533 

## 

Team BOS 

-0.2838 

4.0034 

-0.071 

0.943492 

## 

TeamCHC 

0.4026 

3.9949 

0.101 

0.919749 

## 

TeamCIN 

2.1051 

3.9934 

0.527 

0.598211 

## 

TeamCLE 

-1.3160 

4.0356 

-0.326 

0. 744423 

## 

TeamCOL 

-3.7836 

4.0287 

-0.939 

0.347881 

## 

TeamClA/S 

4.2944 

4.1022 

1.047 

0.295413 

## 

TeamDET 

2.3024 

3.9725 

0.580 

0.562326 

## 

TeamFLA 

2.6985 

4.1336 

0.653 

0.514028 

## 

TeamHOU 

-0.6808 

4.0634 

-0.168 

0.866976 

## 

TeamKC 

-4.7664 

4.0242 

-1.184 

0.236525 

## 

TeamLA 

2.8598 

4.0817 

0.701 

0.483686 

## 

TeamMIN 

2.1269 

4.0947 

0.519 

0.603579 

## 

TeamMLlAl 

4.2897 

4.0243 

1.066 

0.286706 

## 

TeamNYM 

-1.9736 

3.9493 

-0.500 

0.617370 

## 

TeamNYY 

1.7483 

4.1234 

0.424 

0.671655 

## 

TeamOAK 

-0.5464 

3.9672 

-0.138 

0.890474 

## 

Team PHI 

-6.8486 

3.9949 

-1.714 

0.086778 

• 

## 

Team PIT 

4.3023 

4.0210 

1.070 

0.284890 

## 

TeamSD 

2.6133 

4.0915 

0.639 

0.523148 

## 

TeamSEA 

-0.9147 

4.0516 

-0.226 

0.821436 

## 

TeamSF 

0.8411 

4.0520 

0.208 

0.835593 

## 

TeamSTL 

-1.1341 

4.1193 

-0.275 

0.783132 

## 

TeamTB 

-2.6616 

4.0944 

-0.650 

0.515798 

## 

TeamTEX 

-0.7695 

4.0283 

-0.191 

0.848556 

## 

TeamTOR 

1 . 3943 

4.0681 

0.343 

0.731871 

## 

TeamUAS 

-1.7555 

4.0038 

-0.438 

0.661142 

## 

PositionDesignated_Hitter 

8.9037 

4.4533 

1.999 

0.045842 

* 

## 

Position First_Baseman 

2.4237 

3.0058 

0.806 

0.420236 

## 

PositionOut fielder 

-6.2636 

2.2784 

-2.749 

0.006084 

** 

## 

PositionRelief_Pitcher 

-7.7695 

2.1959 

-3.538 

0.000421 

*  *  * 

## 

PositionSecond_Baseman 

-13.0843 

2.9638 

-4.415 

1 . 12e-05 

*  *  * 

## 

PositionShortstop 

-16.9562 

3.0406 

-5.577 

3. 16e- 08 

*  *  * 

## 

PositionStarting_Pitcher 

-7.3599 

2.2976 

-3.203 

0.001402 

** 

## 

PositionThird_Baseman 

-4.6035 

3.1689 

-1.453 

0.146613 

## 

Height 

4.7175 

0.2563 

18.405 

<  2e-16 

*  *  * 

## 

Age 

0.8906 

0.1259 

7.075 

2.82e-12 

*  *  * 

## 

— 

## 

Signif.  codes:  0  '***'  0 

.001  '**'  0.01 

'*'  0.05  ' .  '  0. 

1  '  '  1 

## 

##  ResiduaL  standard  error:  16.78  on  994  degrees  of  freedom 
##  MuLtipLe  R-squared :  0.3858 ,  Adjusted  R-squared :  0.3617 
##  F-statistic :  16.01  on  39  and  994  DF ,  p-vaLue:  <  2.2e-16 

plot (fit ,  which  =  1:2) 
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The  model  summary  shows  us  how  well  the  model  fits  the  data. 

Residuals  This  tells  us  about  the  residuals.  If  we  have  extremely  large  or  extremely 
small  residuals  for  some  observations  compared  to  the  rest  of  residuals,  either  they 
are  outliers  due  to  reporting  error  or  the  model  fits  data  poorly.  We  have  73.649  as 
our  maximum  and  —48.692  as  our  minimum.  The  residuals  could  be  characterized 
by  examining  their  range  and  by  viewing  the  residual  diagnostic  plots. 

Coefficients  In  this  section  of  the  output,  we  look  at  the  very  right  column  that  has 
symbols  like  stars  or  dots  showing  if  that  variable  is  significant  and  should  be 
included  in  the  model.  However,  if  no  symbol  is  included  next  to  a  variable,  then 
it  means  this  estimated  covariate  coefficient  in  the  linear  model  covariance  could  be 
trivial.  Another  thing  we  can  look  at  is  the  Pr  ( >  1 1 1  )  column.  A  number  close  to 
zero  in  this  column  indicates  the  row  variable  is  significant,  otherwise  it  could  be 
removed  from  the  model. 

In  this  example,  some  of  the  teams  and  positions  are  significant  and  some  are  not. 
Both  Age  and  Height  are  significant. 

R- squared  What  percent  in  y  is  explained  by  the  included  predictors?  Here,  we 
have  38.58%,  which  indicates  the  model  is  not  bad  but  could  be  improved.  Usually  a 
well-fitted  linear  regression  would  have  R-squared  over  70%. 

The  diagnostic  plots  also  help  us  understand  the  model  quality. 

Residual  vs\  Fitted  This  is  the  main  residual  diagnostic  plot,  Fig.  10.8.  We  can  see 
that  the  residuals  of  observations  indexed  65,  160  and  237  are  relatively  far  apart 
from  the  rest.  They  may  represent  potential  influential  points  or  outliers. 

Normal  Q-Q  This  plot  examines  the  normality  assumption  of  the  model,  Fig.  10.9. 
The  scattered  dots  represent  the  matched  quantiles  of  the  data  and  the  normal 
distribution.  If  the  Q-Q  plot  closely  resembles  a  line  bisecting  the  first  quadrant  in 
the  plane,  the  normality  assumption  is  valid.  In  our  case,  it  is  relatively  close  to  the 
line.  So,  we  can  say  that  our  model  is  valid  in  terms  of  normality. 


10.4  Step  5:  Improving  Model  Performance 

We  can  employ  the  step  function  to  perform  forward  or  backward  selection  of 
important  features/predictors.  It  works  for  both  lm  and  glm  models.  In  most  cases, 
backward-selection  is  preferable  because  it  tends  to  retain  much  larger  models.  On 
the  other  hand,  there  are  various  criteria  to  evaluate  a  model.  The  common  model 
evaluation  metrics  include  A/C,  BIC ,  Adjusted  R  ,  etc.  In  Chap.  14,  we  will  present 
more  details  about  prediction  evaluation  and  assessment  of  classificaiton.  Let’s 
compare  the  backward  and  forward  model  selection  approaches.  The  step  function 
argument  direction  allows  this  control  (default  is  both,  which  will  select  the 
better  result  from  either  backward  or  forward  selection). 
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step( fit j direction  =  "backward") 


##  Start: 

##  Weight 
## 

## 

##  -  Team 
##  <none> 

##  -  Age 
##  -  Position  8 
##  -  Height  1 
## 


of  Sq  RSS  AIC 
9468  289262  5847.4 
279793  5871.0 
14090  293883  5919.8 
20301  300095  5927.5 
95356  375149  6172.3 


AIC=5871 . 04 

~  Team  +  Position  +  Height  +  Age 

Df  Sum 
29 

1 


##  Step:  AIC=5847 . 45 

##  Weight  ~  Position  +  Height  +  Age 

## 


## 

Df  Sum  of  Sq 

RSS 

AIC 

##  <none> 

289262 

5847.4 

##  -  Age 

1 

14616 

303877 

5896.4 

##  -  Position 

8 

20406 

309668 

5901 . 9 

##  -  Height 

1 

100435 

389697 

6153.6 

## 

##  Call: 

##  Lm(formuLa  =  Weight  ~  Position  +  Height  +  Agej  data  =  mLb) 
## 


## 

Coefficients : 

## 

( Intercept) 

PositionDesignated_Hitter 

## 

-168.0474 

8.6968 

## 

Position  Fir st_Baseman 

PositionOut fielder 

## 

2.7780 

-6.0457 

## 

PositionRelief_Pitcher 

PositionSecond_Baseman 

## 

-7.7782 

-13.0267 

## 

PositionShortstop 

PositionStarting_Pitcher 

## 

-16.4821 

-7.3961 

## 

PositionThird_Baseman 

Height 

## 

-4.1361 

4.7639 

## 

Age 

## 

0.8771 

step (fit j direction  =  "forward" ) 

##  Start:  AIC=5871.04 

##  Weight  ~  Team  +  Position  +  Height  +  Age 
## 

##  Call: 

##  Lm( formula  =  Weight  ~  Team  +  Position  +  Height  +  Age,  data  =  mlb) 


## 

##  Coefficients : 


## 

(Intercept) 

TeamARZ 

## 

-164.9995 

7.1881 

## 

Team  ATI 

TeamBAL 

## 

-1.5631 

-5.3128 

## 

Team BOS 

TeamCHC 

## 

-0.2838 

0.4026 

## 

TeamCIN 

TeamCLE 

## 

2.1051 

-1.3160 

## 

TeamCOL 

TeamCWS 
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## 

-3.7836 

## 

TeamDET 

## 

2.3024 

## 

TeamHOU 

## 

-0.6808 

## 

Team LA 

## 

2.8598 

## 

TeamMLW 

## 

4.2897 

## 

TeamNYY 

## 

1 . 7483 

## 

Team PH I 

## 

-6.8486 

## 

TeamSD 

## 

2.6133 

## 

TeamS F 

## 

0.8411 

## 

TeamTB 

## 

-2.6616 

## 

TeamTOR 

## 

1 . 3943 

## 

PositionDesignated_Hitter 

## 

8.9037 

## 

PositionOutfieLder 

## 

-6.2636 

## 

PositionSecond_ 

Baseman 

## 

13.0843 

## 

PositionStarting_Pitcher 

## 

-7.3599 

## 

Height 

## 

4.7175 

step( fit j direction  = 

"both ") 

4.2944 
TeomFLA 
2.6985 
TeamKC 
-4.7664 
TeamMIN 
2.1269 
TeamNYM 
-1.9736 
TeamOAK 
-0.5464 
TeamPIT 
4.3023 
TeamS E A 
-0.9147 
TeamSTL 
-1.1341 
TeamTEX 
-0.7695 
TeamlA/AS 
-1.7555 

Position First_Baseman 

2.4237 

PositionReLief_Pitcher 

-7.7695 

PositionShortstop 

-16.9562 

PositionThird_Baseman 

-4.6035 

Age 

0.8906 


##  Start:  AIC=5871.04 


##  Weight  ~  Team  +  Position  +  Fleight  +  Age 


## 

##  Df  Sum 

##  -  Team  29 

##  <none> 

##  -  Age  1 

##  -  Position  8 

##  -  Height  1 

## 


of  Sq  RSS  AIC 
9468  289262  5847.4 
279793  5871.0 
14090  293883  5919.8 
20301  300095  5927.5 
95356  375149  6172.3 


##  Step:  AIC=5847 . 45 

##  Weight  ~  Position  +  Height  +  Age 

## 


## 

Df  Sum  of  Sq 

RSS 

AIC 

##  <none> 

289262 

5847.4 

##  +  Team 

29 

9468 

279793 

5871 . 0 

##  - 

1 

14616 

303877 

5896.4 

##  -  Position 

8 

20406 

309668 

5901 . 9 

##  -  Height 

1 

100435 

389697 

6153.6 

## 

##  Call: 

##  Lm(formuLa  =  Weight  ~  Position  +  Height  +  Age}  data  =  mLb) 
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## 

## 

Coefficients : 

## 

( Intercept) 

PositionDesignated_Hitter 

## 

-168.0474 

8.6968 

## 

Position First_Baseman 

PositionOutfieLder 

## 

2.7780 

-6.0457 

## 

PositionReLief_Pitcher 

PositionSecond_Baseman 

## 

-7.7782 

-13.0267 

## 

PositionShortstop 

PositionStarting_Pitcher 

## 

-16.4821 

-7.3961 

## 

PositionThird_Baseman 

Height 

## 

-4.1361 

4.7639 

## 

Age 

## 

0.8771 

We  can  observe  that  forward  retains  the  whole  model.  The  better  feature  selection 
model  uses  backward  step-wise  selection. 

Both  backward  and  forward  are  greedy  algorithms  and  neither  guarantees  an 
optimal  model  result.  The  optimal  feature  selection  requires  exploring  every  possible 
combination  of  the  predictors,  which  is  practically  not  feasible,  due  to  computational 
complexity,  (”)  combinations. 

Alternatively,  we  can  choose  models  based  on  various  information  criteria. 


step ( fit j k=2) 


##  Start:  AIC=5871.Q4 


##  Weight  ~  Team  +  Position  +  Height  +  Age 


## 

##  Df  Sum 

##  -  Team  29 

##  <none> 

##  -  Age  1 

##  -  Position  8 

##  -  Height  1 

## 


of  Sg  RSS  AIC 
9468  289262  5847.4 
279793  5871.0 
14090  293883  5919.8 
20301  300095  5927.5 
95356  375149  6172.3 


##  Step:  AIC=5847 . 45 

##  Weight  ~  Position  +  Height  +  Age 

## 


## 

Df  Sum  of  Sg 

RSS 

AIC 

##  <none> 

289262 

5847.4 

##  -  Age 

1 

14616 

303877 

5896.4 

##  -  Position 

8 

20406 

309668 

5901 . 9 

##  -  Height 

1 

100435 

389697 

6153.6 

## 

##  Call: 

##  Lm(formuLa  =  Weight  ~  Position  +  Height  +  Age}  data  =  mlb) 
## 

##  Coefficients : 

##  (Intercept)  PositionDesignated_Hitter 

##  -168.0474  8.6968 

##  PositionFirst_Baseman  PositionOutfieLder 
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## 

2.7780 

-6.0457 

## 

PositionReLief_Pitcher 

PositionSecond_Baseman 

## 

-7.7782 

-13.0267 

## 

PositionShortstop 

PositionStarting_Pitcher 

## 

-16.4821 

-7.3961 

## 

PositionThird_Baseman 

Height 

## 

-4.1361 

4.7639 

## 

Age 

## 

0.8771 

step(fitj k=Log(nrow(mLb) ) ) 


##  Start:  AIC=6068.69 

##  Height  ~  Team  +  Position  +  Height  +  Age 
## 


##  Df  Sum 

##  -  Team  29 

##  <none> 

##  -  Position  8 

##  -  Age  1 

##  -  Height  1 

## 


of  Sq  RSS  AIC 
9468  289262  5901.8 
279793  6068.7 
20301  300095  6085.6 
14090  293883  6112.5 
95356  375149  6365.0 


##  Step:  AIC=5901 . 8 

##  Height  ~  Position  +  Height  +  Age 

## 


## 

Df 

Sum  of  Sq 

RSS 

AIC 

##  <none> 

289262 

5901 . 8 

##  -  Position 

8 

20406 

309668 

5916.8 

##  - 

1 

14616 

303877 

5945.8 

##  -  Height 

1 

100435 

389697 

6203.0 

## 

##  Call: 

##  Lm(formuLa  =  Height  ~  Position  +  Height  +  Age}  data  =  mlb) 
## 


## 

Coefficients : 

## 

( Intercept) 

PositionDesignated_Hitter 

## 

-168.0474 

8.6968 

## 

Position  Fir st_Baseman 

PositionOut fielder 

## 

2.7780 

-6.0457 

## 

PositionReLief_Pitcher 

PositionSecond_Baseman 

## 

-7.7782 

-13.0267 

## 

PositionShortstop 

PositionStarting_Pitcher 

## 

-16.4821 

-7.3961 

## 

PositionThird_Baseman 

Height 

## 

-4.1361 

4.7639 

## 

Age 

## 

0.8771 

365 


k  =  2  yields  the  AIC  criterion,  and  k  =  /c>g  (A)  refers  to  BIC.  Let’s  try  to  evaluate 
the  model  performance  again  (Figs.  10.10  and  10.11). 
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Residuals  vs  Fitted 


0237 

065  1600 


t  r 

160  ISO  200  220  240 

Fitted  values 

lm(Weight  ~  Position  +  Height  +  Age) 

Fig.  10.10  Residuals  vs.  fitted  values  scatterplot 


Normal  CFG 


Theoretical  Quantiles 
lm( Weight  -  Position  +  Height  +  Age) 


Fig.  10.11  QQ  normal  probability  plot  of  the  model  residuals 
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fit2  =  step (fit jk=2j direction  =  "backward") 


##  Start :  AIC=5871.04 


##  Weight  ~  Team  +  Position  +  Height  +  Age 


## 

##  Of  Sum 

##  -  Team  29 

##  <none> 

##  -  Age  1 

##  -  Position  8 

##  -  Height  1 

## 


of  Sq  RSS  AIC 
9468  289262  5847.4 
279793  5871.0 
14090  293883  5919.8 
20301  300095  5927.5 
95356  375149  6172.3 


##  Step:  AIC=5847 . 45 

##  Weight  ~  Position  +  Height  +  Age 

## 


## 

Df  Sum  of  Sq 

RSS 

AIC 

##  <none> 

289262 

5847.4 

##  -  Age 

1 

14616 

303877 

5896.4 

##  -  Position 

8 

20406 

309668 

5901 . 9 

##  -  Height 

1 

100435 

389697 

6153.6 

summary  (fit  2) 


## 

##  Call: 

##  Lm(formuLa  =  Weight  ~  Position  +  Height  +  Age,  data  =  mlb) 
## 

##  Residuals : 


##  Min  IQ  Median 

3Q 

Max 

##  -49.427  -10.855  -0.344 

## 

##  Coefficients : 

10.110  75. 

301 

## 

Estimate 

Std 

.  Error 

t  value 

Pr(> It  1 ) 

##  (Intercept) 

-168.0474 

19.0351 

-8.828 

<  2e-16 

*  *  * 

##  PositionDesignated_Hitter 

8.6968 

4.4258 

1.965 

0.049679 

* 

##  PositionFirst_Baseman 

2.7780 

2.9942 

0.928 

0.353741 

##  PositionOut fielder 

-6.0457 

2.2778 

-2.654 

0.008072 

** 

##  PositionRelief_Pitcher 

-7.7782 

2.1913 

-3.550 

0. 000403 

*  *  * 

##  PositionSecond_Baseman 

-13.0267 

2.9531 

-4.411 

1 . 14e-05 

*  *  * 

##  PositionShortstop 

-16.4821 

3.0372 

-5.427 

7. 16e-08 

*  *  * 

##  PositionStarting_Pitcher 

-7.3961 

2.2959 

-3.221 

0.001316 

** 

##  PositionThird_Baseman 

-4.1361 

3.1656 

-1.307 

0.191647 

##  Height 

4.7639 

0.2528 

18.847 

<  2e-16 

*  *  * 

##  Age 

0.8771 

0.1220 

7.190 

1 . 25e-12 

*  *  * 

##  --- 

##  Signif.  codes:  0  '***'  0.001  '**'  0.01  0.05  '.'0.1  '  '  1 

## 

##  Residual  standard  error:  16.82  on  1023  degrees  of  freedom 
##  Multiple  R-squared:  0.365j  Adjusted  R-squared :  0.3588 

##  F-statistic :  58.81  on  10  and  1023  DF }  p-value:  <  2.2e-16 


plot(fit2j  which  =  1:2) 


Sometimes,  we  prefer  a  simpler  model  even  if  there  is  slight  loss  in  performance. 
In  this  case,  we  have  a  simpler  model  and  =  0.365.  The  whole  model  is  still  very 
significant.  Some  potential  influential  points  or  outliers  that  are  relatively  far  from 
other  residuals  observations  are  shown  on  Fig.  10.12. 
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Fig.  10.12  A  half-normal  probability  plot  suggesting  important  factors  or  interactions  by  estimat¬ 
ing  the  impact  of  a  given  main  effect,  or  interaction,  and  its  rank  relative  to  other  main  effects  and 
interactions  computed  via  least  squares  estimation.  The  horizontal  and  vertical  axes  represent  the 
(n-1)  theoretical  order  statistic  medians  from  a  half-normal  distribution  and  the  ordered  absolute 
value  of  the  estimated  effects  for  the  main  factors  and  available  interactions,  respectively 


#  Half-normal  plot  for  leverages 

#  install. packages("faraway") 

L ibrary ( faraway ) 

haLfnorm(Lm.infLuence(fit)$hatj  nLab  =  2,  ylab=" Leverages" ) 


mib[c(226j 879),] 


## 

Team 

Position  Height  Weight  Age 

## 

226 

NYY  Designated_Hitter 

75 

230  36.14 

## 

879 

SD  Designated_Hitter 

73 

200  25.60 

summary (mLb) 

## 

Team 

Position 

Height 

Weight 

## 

NYM 

:  38 

Relief _Pitcher 

:  315 

Min.  :67. 0 

Min.  : 150.0 

## 

ATL 

:  37 

Starting_Pitcher 

:221 

1st  Qu. : 72.0 

1st  Qu. :187. 0 

## 

DET 

:  37 

Outfielder 

:  194 

Median  :74.0 

Median  : 200.0 

## 

OAK 

:  37 

Catcher 

:  76 

Mean  :73.7 

Mean  : 201 . 7 

## 

BOS 

:  36 

Second_Baseman 

:  58 

3rd  Qu. : 7 5.0 

3rd  Qu. :215. 0 

## 

CMC 

:  36 

First_Baseman 

:  55 

Max.  :83.0 

Max.  : 290.0 

## 

(Other): 813 

(Other) 

:115 

##  Age 

##  Min.  : 20. 90 
##  1st  Qu.:25.44 
##  Median  : 27.93 
##  Mean  :28.74 
##  3rd  Qu.: 31. 23 
##  Max.  :48. 52 


A  deeper  discussion  of  variable  selection,  controlling  the  false  discovery  rate,  is 
provided  in  Chaps.  17  and  18. 
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10.4.1  Model  Specification:  Adding  Non-linear  Relationships 


In  linear  regression,  the  relationship  between  independent  and  dependent  variables  is 
assumed  to  be  linear.  However,  this  might  not  be  the  case.  The  relationship  between 
age  and  weight  could  be  quadratic,  since  middle-aged  people  might  gain  weight 
dramatically. 


mLb$age2< - ( mib$Age ) A2 
fit2<-Lm(Weight~.j  data=mib) 
summary  (fit  2) 

##  Call: 

##  Lm(formuLa  =  Weight  ~  data  =  mib) 

## 

##  ResiduaLs : 

##  Min  IQ  Median  3Q  Max 

##  -49.068  -10.775  -1.021  9.922  74.693 

## 

##  Coefficients : 


## 

Estimate 

Std.  Error 

t  value 

Pr(>ltl) 

## 

(Intercept) 

-209.07068 

27.49529 

-7.604 

6. 65e-14 

*  *  * 

## 

TeamARZ 

7.41943 

4.25154 

1.745 

0.081274 

• 

## 

TeamATL 

-1.43167 

3.96793 

-0.361 

0.718318 

## 

Team BA L 

-5.38735 

4.01119 

-1.343 

0.179552 

## 

Team BOS 

-0.06614 

3.99633 

-0.017 

0.986799 

## 

TeamCHC 

0.14541 

3.98833 

0.036 

0.970923 

## 

TeamCIN 

2.24022 

3.98571 

0.562 

0.574201 

## 

TeamCLE 

-1.07546 

4.02870 

-0.267 

0.789563 

## 

TeamCOL 

-3.87254 

4.02069 

-0.963 

0.335705 

## 

TeamCWS 

4.20933 

4.09393 

1.028 

0.304111 

## 

TeamDET 

2.66990 

3.96769 

0.673 

0.501160 

## 

TeamFLA 

3.14627 

4.12989 

0.762 

0.446343 

## 

TeamHOU 

-0.77230 

4.05526 

-0.190 

0.849000 

## 

TeamKC 

-4.90984 

4.01648 

-1.222 

0.221837 

## 

TeamLA 

3.13554 

4.07514 

0.769 

0.441820 

## 

TeamMIN 

2.09951 

4.08631 

0.514 

0.607512 

## 

TeamMLW 

4.16183 

4.01646 

1.036 

0.300363 

## 

TeamNYM 

-1.25057 

3.95424 

-0.316 

0.751870 

## 

TeamNYY 

1 . 67825 

4.11502 

0.408 

0.683482 

## 

TeamOAK 

-0.68235 

3.95951 

-0.172 

0.863212 

## 

Team PH I 

-6.85071 

3.98672 

-1.718 

0.086039 

• 

## 

Team PIT 

4.12683 

4.01348 

1.028 

0.304086 

## 

TeamSD 

2.59525 

4.08310 

0.636 

0.525179 

## 

TeamSEA 

-0.67316 

4.04471 

-0.166 

0.867853 

## 

Teams F 

1 . 06038 

4.04481 

0.262 

0.793255 

## 

TeamSTL 

-1.38669 

4.11234 

-0.337 

0.736037 

## 

TeamTB 

-2.44396 

4.08716 

-0.598 

0.550003 

## 

TeamTEX 

-0.68740 

4.02023 

-0.171 

0.864270 

## 

TeamTOR 

1 . 24439 

4.06029 

0.306 

0.759306 

## 

TeamWAS 

-1.87599 

3.99594 

-0.469 

0.638835 

## 

PositionDesignated_Hitter 

8.94440 

4.44417 

2.013 

0.044425 

* 

## 

Position  Fir st_Baseman 

2.55100 

3.00014 

0.850 

0.395368 

## 

PositionOutfieLder 

-6.25702 

2.27372 

-2.752 

0.006033 

** 

## 

PositionReLief_Pitcher 

-7.68904 

2.19166 

-3.508 

0.000471 

*** 

## 

PositionSecond_Baseman 

-13.01400 

2.95787 

-4.400 

1 . 20e-05 

*  *  * 

## 

PositionShortstop 

-16.82243 

3.03494 

-5.543 

3.81e-08 

*** 

## 

PositionStarting  Pitcher 

-7.08215 

2.29615 

-3.084 

0.002096 

** 
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## 

PositionThird_Baseman 

-4.66452 

3.16249 

-1.475  0.140542 

## 

Height 

4.71888 

0.25578 

18.449  <  2e-16 

*** 

## 

Age 

3.82295 

1.30621 

2.927  0.003503 

** 

## 

age2 

-0.04791 

0.02124 

-2.255  0.024327 

* 

## 

## 

Signif.  codes:  0  ' ***' 

0.001  '**'  0.01 

'*'  0.05 

0.1  '  '  1 

## 

##  Residual  standard  error:  16.74  on  993  degrees  of  freedom 
##  Multiple  R-squared:  0.3889 j  Adjusted  R-squared :  0.3643 
##  F-statistic:  15.8  on  40  and  993  DFj  p-value:  <  2.2e-16 


J 

This  actually  brought  up  the  overall  R  up  to  0.3889. 


10.4.2  Transformation:  Converting  a  Numeric  Variable 
to  a  Binary  Indicator 


As  discussed  earlier,  middle-aged  people  might  have  a  different  pattern  in  weight 
increase  compared  to  younger  people.  The  overall  pattern  could  be  not  cumulative, 
but  could  rather  represent  alternative  lines  for  the  young  and  the  middle-aged  people. 
We  assume  30  is  the  age  threshold  separating  young  from  middle-age  players. 
Players  over  30  may  have  a  steeper  line  for  weight  increase  than  those  under  30. 
Here,  we  use  the  if  else  ( )  function  that  we  mentioned  in  Chap.  8  to  create  the 
indicator  of  this  Age  threshold. 


mlb$age30<-ifelse(mlb$Age>=30J  1 ,  0) 

fit3< - Lm(Weight~Team+Position+Age+age30+Height ,  data=mlb ) 
summary  (fit  3) 


## 

##  Call: 

##  Lm( formula  =  Weight  ~  Team  +  Position  +  Age  +  age30  +  Heightj 
##  data  =  mlb) 

## 

##  Residuals : 


## 

Min  IQ 

Median 

3Q 

Max 

## 

-48.313  -11.166 

-0.916 

10.044  73. 

630 

## 

## 

Coefficients : 

## 

Estimate 

Std.  Error 

t  value 

Pr(> It  1 ) 

## 

(Intercept) 

-159.8884 

19.8862 

-8.040 

2. 54e-15  * 

## 

TeamARZ 

7.4096 

4.2627 

1.738 

0.082483  . 

## 

TeamATL 

-1.4379 

3.9765 

-0.362 

0.  717727 

## 

Team BA L 

-5.3393 

4.0187 

-1.329 

0.184284 

## 

Team BOS 

-0.1985 

4.0034 

-0.050 

0.960470 

## 

TeamCHC 

0.4669 

3.9947 

0.117 

0.906976 

## 

TeamCIN 

2.2124 

3.9939 

0.554 

0.579741 

## 

TeamCLE 

-1.1624 

4.0371 

-0.288 

0. 773464 

## 

TeamCOL 

-3.6842 

4.0290 

-0.914 

0.360717 

## 

TeamCWS 

4.1920 

4.1025 

1.022 

0.307113 

## 

TeamDET 

2.4708 

3.9746 

0.622 

0.534314 

## 

TeamFLA 

2.8563 

4.1352 

0.691 

0.489903 
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## 

TeamHOU 

-0.4964 

4.0659 

-0.122 

0.902846 

## 

TeamKC 

-4.7138 

4.0238 

-1.171 

0.241692 

## 

TeamLA 

2.9194 

4.0814 

0.715 

0.474586 

## 

TeamMIN 

2.2885 

4.0965 

0.559 

0.576528 

## 

TeamMLlAl 

4.4749 

4.0269 

1.111 

0.266731 

## 

TeamNYM 

-1.8173 

3.9510 

-0.460 

0.645659 

## 

TeamNYY 

1 . 7074 

4.1229 

0.414 

0. 678867 

## 

TeamOAK 

-0.3388 

3.9707 

-0.085 

0.932012 

## 

TeamPHI 

-6.6192 

3.9993 

-1.655 

0.098220 

• 

## 

TeamPIT 

4.6716 

4.0332 

1.158 

0.247029 

## 

TeamS D 

2.8600 

4.0965 

0.698 

0.485243 

## 

TeamS E A 

-1.0121 

4.0518 

-0.250 

0.802809 

## 

TeamSF 

1 . 0244 

4.0545 

0.253 

0.800587 

## 

TeamSTL 

-1.1094 

4.1187 

-0.269 

0.787703 

## 

TeamTB 

-2.4485 

4.0980 

-0.597 

0.550312 

## 

TeamTEX 

-0.6112 

4.0300 

-0.152 

0.879485 

## 

TeamTOR 

1.3959 

4.0674 

0.343 

0.731532 

## 

TeamWAS 

-1.4189 

4.0139 

-0.354 

0. 723784 

## 

PositionDesignated_Hitter 

9.2378 

4.4621 

2.070 

0.038683 

* 

## 

Position  Fir st_Baseman 

2.6074 

3.0096 

0.866 

0.386501 

## 

PositionOutfielder 

-6.0408 

2.2863 

-2.642 

0.008367 

** 

## 

PositionRelief_Pitcher 

-7.5100 

2.2072 

-3.403 

0.000694 

*** 

## 

PositionSecond_Baseman 

-12.8870 

2 . 9683 

-4.342 

1 . 56e-05 

*** 

## 

PositionShortstop 

-16.8912 

3 . 0406 

-5.555 

3. 56e-08 

*** 

## 

PositionStarting_Pitcher 

-7.0825 

2.3099 

-3.066 

0.002227 

** 

## 

PositionThird_Baseman 

-4.4307 

3.1719 

-1.397 

0.162773 

## 

Age 

0.6904 

0.2153 

3.207 

0. 001386 

** 

## 

age30 

2.2636 

1 . 9749 

1.146 

0.251992 

## 

## 

Fleight 

4. 7113 

0.2563 

18.380 

<  2e-16 

*** 

## 

Signif.  codes:  0  '***'  q. 

001  '**'  0.01 

0.05 

0.1  '  '  1 

## 

##  Residual  standard  error:  16.77  on  993  degrees  of  freedom 
##  Multiple  R-squared :  0.3866j  Adjusted  R-squared :  0.3619 

##  F-statistic :  15.65  on  40  and  993  DFj  p-value:  <  2.2e-16 

This  model  performs  worse  than  the  quadratic  model  in  terms  of  R  .  Moreover, 
age  3  0  is  not  significant.  So,  we  will  stick  with  the  earlier  quadratic  model. 


10.4.3  Model  Specification:  Adding  Interaction  Effects 

So  far,  each  feature’s  individual  effect  was  considered  in  our  models.  It  is  possible 
that  features  act  in  pairs  to  affect  the  independent  variable.  Let’s  examine  that 
deeper. 

Interactions  are  combined  effects  of  two  or  more  features.  If  we  are  not  sure 
whether  two  features  interact  term  we  could  test  by  adding  an  interaction  term  into 
the  model.  If  the  interaction  term  is  significant,  it  confirms  that  there  may  be  non¬ 
trivial  interaction  between  the  features. 
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fit4<-lm(Weight~Team+Height+Age*Position+age2j  data=mLb ) 
summary  (fit4) 

##  Call: 

##  Lm( formula  =  Weight  ~  Team  +  Height  +  Age  *  Position  +  age2} 
##  data  =  mlb) 

## 

##  Residuals : 

##  Min  IQ  Median  3Q  Max 

##  -48.761  -11.849  -0.761  9.911  75.533 

## 

##  Coefficients: 


## 

Estimate 

Std.  Error 

t  value  Pr(> It / J 

##  (Intercept) 

-199.15403 

29.87269 

-6.667 

4. 35e-ll 

## 

TeamARZ 

8.10376 

4.26339 

1.901 

0.0576 

## 

Team  ATI 

-0.81743 

3.97899 

-0.205 

0.8373 

## 

TeamBAL 

-4.64820 

4.03972 

-1.151 

0.2502 

## 

Team BOS 

0.37698 

4.00743 

0.094 

0.9251 

## 

TeamCHC 

0.33104 

3.99507 

0.083 

0.9340 

## 

TeamCIN 

2.56023 

3.99603 

0.641 

0.5219 

## 

TeamCLE 

-0.66254 

4.03154 

-0.164 

0.8695 

## 

TeamCOL 

-3.72098 

4.03759 

-0.922 

0.3570 

## 

TeamCWS 

4.63266 

4.10884 

1.127 

0.2598 

## 

TeamDET 

3.21380 

3.98231 

0.807 

0.4199 

## 

TeamFLA 

3.56432 

4.14902 

0.859 

0.3905 

## 

TeamHOU 

-0.38733 

4.07249 

-0.095 

0.9242 

## 

TeamKC 

-4.66678 

4.02384 

-1.160 

0.2464 

## 

TeamLA 

3.51766 

4 . 09400 

0.859 

0.3904 

## 

TeamMIN 

2.31585 

4.10502 

0.564 

0.5728 

## 

TeamMLW 

4.34793 

4.02501 

1.080 

0.2803 

## 

TeamNYM 

-0.28505 

3.98537 

-0.072 

0.9430 

## 

TeamNYY 

1 . 87847 

4.12774 

0.455 

0.6491 

## 

TeamOAK 

-0.23791 

3.97729 

-0 . 060 

0.9523 

## 

TeamPHI 

-6.25671 

3.99545 

-1.566 

0.1177 

## 

TeamPIT 

4.18719 

4.01944 

1.042 

0.2978 

## 

TeamSD 

2.97028 

4.08838 

0.727 

0.4677 

## 

TeamS E A 

-0.07220 

4.05922 

-0.018 

0.9858 

## 

TeamS F 

1 . 35981 

4.07771 

0.333 

0.7388 

## 

TeamSTL 

-1.23460 

4.11960 

-0.300 

0.7645 

## 

TeamTB 

-1.90885 

4.09592 

-0.466 

0.6413 

## 

TeamTEX 

-0.31570 

4.03146 

-0.078 

0.9376 

## 

TeamTOR 

1 . 73976 

4.08565 

0.426 

0.6703 

## 

TeamWAS 

-1.43933 

4.00274 

-0.360 

0.7192 

## 

Height 

4.70632 

0.25646 

18.351 

<  2e-16 

## 

3.32733 

1.37088 

2.427 

0.0154 

## 

PositionDesignated_Hitter 

-44.82216 

30.68202 

-1.461 

0.1444 

## 

Position First_Baseman 

23.51389 

20.23553 

1.162 

0.2455 

## 

PositionOutfielder 

-13.33140 

15.92500 

-0.837 

0.4027 

## 

PositionRelief_Pitcher 

-16.51308 

15.01240 

-1.100 

0.2716 

## 

PositionSecond_Baseman 

-26.56932 

20.18773 

-1.316 

0.1884 

## 

PositionShortstop 

-27.89454 

20.72123 

-1.346 

0.1786 

## 

PositionStarting_Pitcher 

-2.44578 

15.36376 

-0.159 

0.8736 

## 

PositionThird_Baseman 

-10.20102 

23.26121 

-0.439 

0.6611 

##  c/ge2 

-0.04201 

0.02170 

-1.936 

0.0531 

## 

Age : PositionDesignated_Hitter 

1 . 77289 

1 . 00506 

1.764 

0.0780 
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##  Age : Position Fir st_Baseman 

-0.71111 

0.67848 

-1.048 

0.2949 

##  Age : PositionOut fielder 

0.24147 

0.53650 

0.450 

0.6527 

##  Age : PositionRelief_Pitcher 

0.30374 

0.50564 

0.601 

0.5482 

##  Age : PositionSecond_Baseman 

0.46281 

0.68281 

0.678 

0.4981 

##  Age : PositionShortstop 

0.38257 

0. 70998 

0.539 

0.5901 

##  Age : PositionStarting_Pitcher 

-0.17104 

0.51976 

-0.329 

0.7422 

##  Age : PositionThird_Baseman 
##  --- 

##  Signif.  codes:  0  '***'  0.001 

0.18968 

0. 79561 

0.238 

0.8116 

'**'  0'01  '*' 

0.05  ' 

0.1  '  ' 

1 

## 


##  Residual  standard  error:  16.73  on  985  degrees  of  freedom 
##  Multiple  R-squared :  0.3945j  Adjusted  R-squared :  0.365 

##  F-statistic :  13.37  on  48  and  985  DF }  p-value:  <  2.2e-16 

The  results  indicates  that  the  overall  improved  and  some  of  the  interactions  are 
significant  at  the  under  0.1  level. 


10.5  Understanding  Regression  Trees  and  Model  Trees 


As  we  saw  in  Chap.  9,  a  decision  tree  builds  by  multiple  i  f  -  el  se  logical  decisions 
and  can  classify  observations.  We  could  add  regression  into  decision  trees  so  that  a 
decision  tree  can  make  numerical  predictions. 


10.5.1  Adding  Regression  to  Trees 

Numeric  prediction  trees  are  built  in  the  same  way  as  classification  trees.  Data  are 
partitioned  first  via  a  divide-and-conquer  strategy  based  on  features.  Recall  that, 
homogeneity  in  classification  trees  may  be  assessed  by  measures  like  the  entropy.  In 
prediction,  tree  homogeneity  is  measured  by  statistics  such  as  variance,  standard 
deviation  or  absolute  deviation,  from  the  mean. 

A  common  splitting  criterion  for  regression  trees  is  the  standard  deviation 
reduction  (SDR). 


n 


SDR  =  sd(T)  -  ^2 


i=  1 


T: 


T 


x  sd(Ti), 


where  sd(T)  is  the  standard  deviation  for  the  original  data.  After  the  summation  of  all 
segments,  -f  is  the  proportion  of  observations  in  the  ith  segment  compared  to  the 
total  number  of  observations  and  sd(Tj)  is  the  standard  deviation  for  the  ith  segment. 
Let’s  look  at  one  simple  example. 


Original  data  :  {1, 2,  3,  3, 4, 5, 6, 6, 7,  8}, 
Split  method  1  :  {1, 2, 3 1 3, 4, 5, 6, 6, 7,  8},  and 
Split  method 2  :  {1, 2,  3, 3, 4, 5|6, 6, 7,  8}. 
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In  split  method  1,  T\  =  {1,2,3},  T2  =  {3,4, 5, 6, 6, 7, 8}.  In  split  method 
2,  Ti  =  {1, 2, 3, 3,4, 5},  T2  =  {6, 6, 7, 8}. 

ori<-c(lj  2 ,  3 ,  3,  4 ,  5,  6,  6,  7 ,  8) 
atl<-c(lj  2 j  3) 
at2<-c(3j  4 ,  5 ,  6 ,  6,  7,  8) 
btl<-c(lj  2j  3,  3j  4j  5) 
bt2<-c(6j  6 ,  7,  8) 

sdr_a  <-  sd(ori) - (Length (atl)/ Length (ori)*sd(atl) +Length(at2)/ Length (ori)  * 
sd(at2) ) 

sdr_b  <-  sd(ori) - (Length (btl)/ Length (ori) *sd(btl)+Length(bt2)/ Length (ori)  * 

sd(bt2) ) 

sdr_a 

##  [1]  0. 7702557 
sdr_b 

##  [1]  1.041531 

length  ( )  is  used  in  the  above  R  codes  to  get  the  number  of  elements  in  a 
specific  vector. 

Larger  SDR  indicates  greater  reduction  in  standard  deviation  after  splitting. 
Here,  split  method  2  yields  greater  SDR,  so  the  regression  tree  split  will  use  the 
second  method,  which  results  in  more  homogeneous  sets  than  the  first  method. 

Now,  the  tree  will  be  split  under  btl  and  bt2  following  the  same  rules  (greater 
SDR  wins).  When  we  cannot  split  further  btl  and  bt2  are  terminal  nodes.  The 
observations  classified  into  btl  will  be  predicted  with  meanibtl)  =  3,  and  those 
classified  as  bt2  with  meanibtl)  =  6.75. 


10.6  Case  Study  2:  Baseball  Players  (Take  2) 

10.6.1  Step  2:  Exploring  and  Preparing  the  Data 

We  will  continue  with  the  MLB  dataset,  which  includes  1,034  observations.  Let’s  try 
to  randomly  separate  them  into  training  and  testing  datasets  first. 


set.seed(1234) 

train_index  <-  sampLe(seq_Len(nrow(mLb) ) ,  size  =  0.75*nrow(mLb)) 
mLb_train<-mLb[train_indeXj  ] 
mLb_test<-mLb[ -train_indeXj  ] 


We  used  a  random  75-25%  split  to  divide  the  data  into  training  and  testing  sets. 
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10.6.2  Step  3:  Training  a  Model  On  the  Data 


In  R,  the  function  rpart  ( ) ,  under  the  rpart  package,  provides  regression  tree 
modeling: 

m<-rpart(dv~iv,  data=mydata) 

•  dv:  dependent  variable 

•  iv:  independent  variable 

•  my  data:  training  data  containing  dv  and  iv. 

We  use  two  numerical  features  in  the  MLB  data  "01a_data.txt"  Age  and  Height 
as  features. 


#install. packages ("rpart" ) 

Library(rpart) 

mlb.  rpart  <  -  rpart  (\4eight~Height+Age ,  data=mLb_train) 
mlb.  rpart 

##  n=  775 
## 

##  node)j  spLitj  n>  deviance j  yval 
##  *  denotes  terminal  node 

## 

##  1)  root  775  323502.600  201.4361 

##  2)  Height<  73.5  366  112465.500  192.5000 

##  4)  Height<  70.5  55  9865.382  178.7818  * 

##  5)  Height>=70 . 5  311  90419.300  194.9260 

##  10)  Age<  31.585  234  71123.060  192.8547  * 

##  11)  Age>=31 . 585  77  15241.250  201.2208  * 

##  3)  Height>=73 . 5  409  155656.400  209.4328 

##  6)  Height<  76.5  335  118511.700  206.8627 

##  12)  Age<  28.6  194  75010.250  202.2938 

##  24)  Height<  74.5  76  20688.040  196.8026  * 

##  25)  Height>=74. 5  118  50554.610  205.8305  * 

##  13)  Age>=28. 6  141  33879.870  213.1489  * 

##  7)  Height>=76 . 5  74  24914.660  221.0676 

##  14)  Age<  25.37  12  3018.000  206.0000  * 

##  15)  Age>=25 . 37  62  18644.980  223.9839  * 

The  output  contains  rich  information,  split  indicates  the  decision  criterion;  n  is 
the  number  of  observations  that  fall  in  this  segment;  yval  is  the  predicted  value  if 
the  test  data  falls  into  a  segment. 


10.6.3  Visualizing  Decision  Trees 


A  fancy  way  of  drawing  the  rpart  decision  tree  is  by  the  rpart  .plot  () 
function  under  rpart  .plot  package  (Fig.  10.13). 


#  install . packages ( "rpart . plot" ) 

Library (rpart . plot) 

rpart. plot (mlb. rpart j  digits=3) 
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Fig.  10.13  MLB  decision  tree  partitioning 
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Fig.  10.14  Expanding  the  decision  tree  by  specifying  significant  digits,  drawing  separate  split 
labels  for  the  left  and  right  directions,  displaying  the  number  and  percentage  of  observations  in  the 
node,  and  positioning  the  leaf  nodes  at  the  bottom  of  the  graph 


A  more  detailed  graph  can  be  obtained  by  specifying  more  options  in  the  function 
call  (Fig.  10.14). 

rpart.pLot(mLb. rpartj  digits  =  4,  foLLen .  Leaves  =  J}  type=3j  extra=101) 

We  may  also  use  a  more  elaborate  tree  plot  from  package  rattle  to  observe  the 
order  and  rules  of  splits  (Fig.  10.15). 

Library ( rattLe) 

fancyRpartPLot(mLb. rpartj  cex  =  0.8) 
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Fig.  10.15  An  alternative  plot  of  the  MLB  decision  tree 


10.6.4  Step  4:  Evaluating  Model  Performance 

Let’s  make  predictions  with  the  regression  tree  model  using  the  predict  () 
command. 


mLb . p< -predict (mLb .  rpartj  mLb_test) 
summary (mLb.p) 


## 

Min.  1st  Qu. 

Median 

Mean 

3rd  Qu. 

Max. 

## 

185.4  194.2 

201.3 

201.8 

213.4 

222.3 

summary (mLb_test$Ueight) 

## 

Min.  1st  Qu. 

Median 

Mean 

3rd  Qu. 

Max. 

## 

150.0  190.0 

200.0 

202.4 

215.0 

260.0 

Comparing  the  five-number  statistics  for  the  predicted  and  true  Weight,  we  can 
see  that  the  model  cannot  precisely  identify  extreme  cases  such  as  the  maximum. 
However,  within  the  IQR,  the  predictions  are  relatively  accurate.  Correlation  could 
be  used  to  measure  the  correspondence  of  two  equal  length  numeric  variables.  Let’s 
use  cor  ( )  to  examine  the  prediction  accuracy. 

cor(mLb.pj  mib_test$\4eight) 

##  [1]  0.4940257 

The  predicted  values  are  moderately  correlated  with  the  true  values. 
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10.6.5  Measuring  Performance  with  Mean  Absolute  Error 

To  measure  the  distance  between  predicted  value  and  the  true  value,  we  can  use  a 
measurement  called  mean  absolute  error  (MAE)  defined  by  the  formula: 


where  the  predi  is  the  ith  predicted  value  and  obst  is  the  ith  observed  value.  Let’s 
make  a  corresponding  MAE  function  in  R  and  evaluate  our  model  performance. 

MAE<-function(obSj  pred){ 
mean ( abs ( obs - pred ) ) 

} 

MAE(mLb_test$lA/eightj  mLb.p) 

##  [1]  14.97519 

This  implies  that  on  average,  the  difference  between  the  predicted  value  and  the 
observed  value  is  14.975.  Considering  that  the  Weight  variable  in  our  test  dataset 
ranges  from  150  to  260,  the  model  is  reasonable. 

What  if  we  used  a  more  the  most  primitive  method  for  prediction  -  the  test  data 

mean? 

mean (mLb_tes t$We  ight) 

##  [1]  202.3643 

MAE(mLb_test$lAleightJ  202.3643) 

##  [1]  17.11207 

This  shows  that  the  regression  decision  tree  is  better  than  using  the  mean  to 
predict  every  observation  in  the  test  dataset.  However,  it  is  not  dramatically  better. 
There  might  be  room  for  improvement. 


10.6.6  Step  5:  Improving  Model  Performance 

To  improve  the  performance  of  our  decision  tree,  we  are  going  to  use  a  model  tree 
instead  of  a  regression  tree.  We  can  use  the  M5P  ( )  function,  under  the  package 
RWeka,  which  implements  the  M5  algorithm.  This  function  uses  similar  syntax  as 
rpart  ( ) . 
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m<-M5P(dv~iv,  data=mydata) 

#in stall. packages ( "RWeka" ) 

#  Sometimes  RWeka  installations  may  be  off  a  bit,  see: 

#  http: //stackoverflow. com/quest ion s/41878226/ u sing -rweka-m5p- in -rstudio-yie 
Ids- java- lang-noclassdef founder ror- no- uib-cipr-ma 

Sys .getenv( "UEKA_HOME" )  #  where  does  it  point  to?  Maybe  some  obscure  path? 

##  [1]  "" 

#  if  yeSj  correct  the  variable: 

Sys .  setenv ( Ia/EKA_HOME=  "C :  \  \MY\  \PATH\  \IaIEKA_IaIPM" ) 

Library  (RUIeka) 

#  WPM( "list-packages" "installed") 

mLb  .m5<-M5P(\4eight~Height+Agej  data=mLb_train) 
mLb.m5 

##  M5  pruned  modeL  tree: 

##  (using  smoothed  Linear  modeLs) 

##  LM1  ( 776/82.097 %) 

## 

##  LM  num:  1 
##  height  = 

##  4.9957  *  Height 

##  +  1.0629  *  Age 

##  -  197. 0898 

## 

##  Number  of  RuLes  :  1 

Instead  of  using  segment  averages  to  predict  the  player’s  weight,  this  model  uses 
a  linear  regression  (LM1)  as  the  terminal  node.  In  some  datasets  with  more  variables, 
M5P  could  yield  different  linear  models  at  each  terminal  node. 

summary  (mLb.m5) 


##  ===  Summary  === 

## 

##  CorreLation  coefficient 
##  Mean  absoLute  error 
##  Root  mean  squared  error 
##  ReLative  absoLute  error 
##  Root  reLative  squared  error 
##  TotaL  Number  of  Instances 

m Lb. p.m5<- predict (m Lb. m5j  mLb_test) 
summary (mLb. p.m5) 


0.571 
13.3503 
17.1622 
80.3144  % 
82.0969  % 
776 


##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 
##  166.1  193.5  201.7  202.0  209.7  247.8 


cor(mLb.p.m5j  mLb_test$Neight) 
##  [1]  0.5500171 
MAE(mLb_test$Ueightj  mLb.p.m5) 
##  [1]  14.07716 
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summary  (mlb  .m5)  reports  some  rough  diagnostic  statistics.  We  can  see  that 
the  correlation  and  MAE  for  this  model  are  better  than  the  previous  rpart  ( ) 
model. 


10.7  Practice  Problem:  Heart  Attack  Data 


Let’s  go  back  to  the  heart  attack  dataset  (CaseStudyl2_  AdultsHeartAttack_Data. 
csv)  and  practice  this  approach. 


heart_attack< -  read . csv (" https : //umich . instructure. com/ files/ 1644953 /download 

?download_frd=l"j  stringsAsFactors  =  F) 

str(heart_attach) 


##  ' data. frame ' : 

##  $  Patient  :  int 

##  $  DIAGNOSIS:  int 

41041  . . . 

##  $  SEX  :  chr 

##  $  DRG  :  int 

##  $  DIED  :  int 

##  $  CHARGES  :  chr 

##  $  LOS  :  int 

##  $  AGE  :  int 


150  obs.  of  8  variables : 

123456789  10... 

41041  41041  41091  41081  41091  41091  41091  41091  41041 

n  p  n  n  p  ii  ii  p  ii  n  p  ii 

122  122  122  122  122  121  121  121  121  123  ... 
0000000001  ... 

"4752"  "3941"  "3657"  "1481"  ... 

10  6  5  2  1  9  15  15  2  1  ... 

79  34  76  80  55  84  84  70  76  65  ... 


First,  we  need  to  convert  the  CHARGES  (independent  variable)  to  numerical  form. 
NA‘s  are  created,  so  let’s  keep  only  the  complete  cases  as  mentioned  in  the 
beginning  of  this  Chapter.  Also,  let’s  create  a  gender  variable  as  an  indicator  for 
female  patients  using  if  else  ( )  and  delete  the  previous  SEX  column. 


heart_attack$CHARGES< -as . numeric (heart_attack$CHARGES) 

heart_attack<-heart_attack [complete. cases ( heart_attack) ,  ] 
heart_attack$gender<-ifelse(heart_attack$SEX=="F"j  1,  0) 
heart_attack<-heart_attack[j  -3] 

Next,  we  can  build  a  model  tree  using  M5P  ( )  with  all  the  features  in  the  model. 
As  usual,  we  need  to  separate  the  heart_attack  data  into  training  and  test 
datasets  (e.g.,  use  the  75-25%  random  split). 

Using  the  model  to  predict  CHARGES  in  the  test  dataset,  we  can  obtain  the 
following  correlation  and  MAE. 

##  [1]  0.5616003 
##  [1]  3193.502 

We  can  see  that  the  predicted  values  and  observed  values  are  strongly  correlated. 
In  terms  of  MAE,  it  may  seem  very  large  at  first  glance. 
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range (ha_test$CHARGES) 

##  [1]  701  17137 

#  17137-701 

#  3193.502/16436 

However,  the  test  data  itself  has  a  wide  range,  and  the  MAE  is  within  20%  of  the 
range.  With  only  148  observations,  the  model  represents  a  fairly  good  prediction  of 
the  expected  hospital  stay  charges.  Try  to  reproduce  these  results  and  also  test  the 
same  techniques  to  other  data  from  the  list  of  our  Case-Studies. 


10.8  Assignments:  10.  Forecasting  Numeric  Data  Using 
Regression  Models 

Use  the  Quality  of  Life  data  (Case06_QoL_Symptom_ChronicIllness)  to  fit  several 

different  Multiple  Linear  Regression  models  predicting  clinically  relevant  outcomes, 

e.g.,  Chronic  Disease  Score. 

•  Summarize  and  visualize  data  using  summary,  str,  pairs . panels, 
ggplot. 

•  Report  correlation  for  numeric  data  and  try  to  visualize  it  (e.g.,  heatmap,  pairs 
plot,  etc.) 

•  Examine  potential  dependences  of  the  predictors  and  the  dependent  response 
variables. 

•  Lit  a  couple  of  Multiple  Linear  Regression  models,  report  the  results,  and  explain 
the  summary,  residuals,  effect-size  coefficients,  and  the  coefficient  of  determina¬ 
tion,  R 2. 

•  Draw  model  diagnostic  plots,  at  least  QQ  plot,  residuals  plot  and  leverage  plot 
(half  norm  plot). 

•  Report  results  in  terms  of  the  model. 

•  Predict  outcomes  for  new  data. 

•  Try  to  improve  the  model  performance  using  step  function  based  on  AIC 
and  BIC. 

•  Lit  a  regression  tree  model  and  compare  with  OLS  model. 

•  Try  to  use  RWeka  :  :  M5P  to  improve  the  model. 
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Chapter  11 

Black  Box  Machine-Learning  Methods: 
Neural  Networks  and  Support  Vector 
Machines 


® 

Check  for 
updates 


In  this  Chapter,  we  are  going  to  cover  two  very  powerful  machine-learning  algo¬ 
rithms.  These  techniques  have  complex  mathematical  formulations;  however,  effi¬ 
cient  algorithms  and  reliable  software  packages  have  been  developed  to  utilize  them 
for  various  practical  applications.  We  will  (1)  describe  Neural  Networks  as  ana¬ 
logues  of  biological  neurons;  (2)  develop  hands-on  a  neural  net  that  can  be  trained  to 
compute  the  square-root  function;  (3)  describe  support  vector  machine  (SVM) 
classification;  and  (4)  complete  several  case-studies,  including  optical  character 
recognition  (OCR),  the  Iris  flowers,  Google  Trends  and  the  Stock  Market,  and 
Quality  of  Life  in  chronic  disease. 

Later,  in  Chap.  23,  we  will  provide  more  details  and  additional  examples  of  deep 
neural  network  learning.  For  now,  let’s  start  by  exploring  the  magic  inside  the 
machine  learning  black  box. 


11.1  Understanding  Neural  Networks 

11.1.1  From  Biological  to  Artificial  Neurons 

An  Artificial  Neural  Network  (ANN)  model  mimics  the  biological  brain 
response  to  multisource  inputs,  e.g.,  sensory-motor  stimuli.  ANNs  simulate  the  brain 
using  networks  of  interconnected  neuron  cells  to  create  massively  parallel  proces¬ 
sors.  Of  course,  ANNs  use  networks  of  artificial  nodes,  not  brain  cells,  to  train  data. 

When  we  have  three  signals  (or  inputs)  xi,  x2  and  r3,  the  first  step  is  weighting  the 
features  (w’s)  according  to  their  importance.  Then,  the  weighted  signals  are  summed 
by  the  “neuron  cell”  and  this  sum  is  passed  on  according  to  an  activation  function 
denoted  by  f.  The  last  step  is  generating  an  output  y  at  the  end  of  the  process. 
A  typical  output  will  have  the  following  mathematical  relationship  to  the  inputs. 
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y(x)  =f 


There  are  three  important  components  for  building  a  neural  network: 

•  Activation  function:  transforms  weighted  and  summed  inputs  to  an  output. 

•  Network  topology:  describes  the  number  of  “neuron  cells”,  the  number  of  layers 
and  manner  in  which  the  cells  are  connected. 

•  Training  algorithm:  how  to  determine  weights  wt. 

Let’s  look  at  each  of  these  components  one  by  one. 


11.1.2  Activation  Functions 


One  of  the  functions,  known  as  threshold  activation  function,  triggers  an  output 
signal  once  a  specified  input  threshold  has  been  attained  (Fig.  11.1). 


0  v  <  0 
1  v  >  0 


This  is  the  simplest  form  for  activation  functions.  It  is  rarely  used  in  real  world 
situations.  The  most  commonly  used  alternative  is  the  sigmoid  activation  function, 
where /(v)  =  x+e-x-  Here,  e  is  Euler’s  natural  number,  which  is  also  the  base  of  the 

natural  logarithm  function.  The  output  signal  is  no  longer  binary  but  can  be  any  real 
number  ranged  from  0  to  1  (Fig.  11.2). 


Fig.  11.1  An  example  of  a 
hard  threshold  activation 
function,  fix) 


Threshold 
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Fig.  11.2  The  S- shaped 
sigmoid  activation  function 


Sigmoid 


Linear 


Saturated  Linear 


Hyperbolic  tangent 


Fig.  11.3  Alternative  types  of  activation  functions 


Other  activation  functions  might  also  be  useful,  Fig.  11.3. 

Basically,  we  can  chose  a  proper  activation  function  based  on  the  corresponding 
codomain  of  the  function.  For  example,  with  hyperbolic  tangent  activation  function, 
we  can  only  have  outputs  ranging  from  —  1  to  1  regardless  of  the  input.  With  linear 
function  we  can  go  from  —  oo  to  +oo.  Our  Gaussian  activation  function  will  give  us  a 
model  called  Radial  Basis  Function  network  (Fig.  11.3). 
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Fig.  11.4  A  schematic  of  a  two-layer  neural  network 


11.1.3  Network  Topology 

Number  of  layers:  The  x’s  or  features  in  the  dataset  are  called  input  nodes  while  the 
predicted  values  are  called  the  output  nodes.  Multilayer  networks  include  multiple 
hidden  layers.  Figure  11.4  shows  a  two  layer  neural  network. 

When  we  have  multiple  layers,  the  information  flow  could  be  complicated. 


11.1.4  The  Direction  of  Information  Travel 

The  arrows  in  the  last  graph  (with  multiple  layers)  suggest  a  feed  forward  network. 
In  such  a  network,  we  can  also  have  multiple  outcomes  modeled  simultaneously 
(Fig.  11.5). 

Alternatively,  in  a  recurrent  network  (feedback  network),  information  can  also 
travel  backwards  in  loops  (or  delay).  This  is  illustrated  in  Fig.  11.6,  where  the  short¬ 
term  memory  increases  the  power  of  recurrent  networks  dramatically.  However,  in 
practice,  recurrent  networks  are  rarely  used. 


11.1.5  The  Number  of  Nodes  in  Each  Layer 

The  number  of  input  nodes  and  output  nodes  are  predetermined  by  the  dataset  and 
the  predictive  variables.  The  number  we  can  specify  determines  the  hidden  nodes  in 
the  model.  To  simplify  the  model,  our  goal  is  to  add  the  least  number  of  hidden 
nodes  possible  when  the  model  performance  remains  reasonable. 
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Fig.  11.5  A  multi-output  neural  network  example 


Fig.  11.6  A  schematic  of  a  delay  (feedback)  neural  network 


11.1.6  Training  Neural  Networks  with  Backpropagation 

This  algorithm  could  determine  the  weights  in  the  model  using  the  strategy  of  back- 
propagating  errors.  First,  we  assign  random  weights  (but  all  weights  must  be 
non-trivial).  For  example,  we  can  use  normal  distribution,  or  any  other  random 
process,  to  assign  initial  weights.  Then  we  adjust  the  weights  iteratively  by  repeating 
the  process  until  certain  convergence  or  stopping  criterion  is  met.  Each  iteration 
contains  two  phases. 
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•  Forward  phase:  from  input  layer  to  output  layer  using  current  weights.  Outputs 
are  produced  at  the  end  of  this  phase,  and 

•  Backward  phase:  compare  the  outputs  and  true  target  values.  If  the  difference  is 
significant,  we  change  the  weights  and  go  through  the  forward  phase,  again. 

In  the  end,  we  pick  a  set  of  weights,  which  correspond  to  the  least  total  error,  to  be 
the  final  weights  in  our  network. 


11.2  Case  Study  1:  Google  Trends  and  the  Stock  Market: 
Regression 

11.2.1  Step  1:  Collecting  Data 

In  this  case  study,  we  are  going  to  use  the  Google  trends  and  stock  market  dataset.  A 
doc  file  with  the  meta-data  and  the  CSV  data  are  available  on  the  Case-Studies 
Canvas  Site.  These  daily  data  (between  2008  and  2009)  can  be  used  to  examine  the 
associations  between  Google  search  trends  and  the  daily  marker  index  -  Dow  Jones 
Industrial  Average. 


Variables 

•  Index:  Time  Index  of  the  Observation 

•  Date:  Date  of  the  observation  (Format:  YYYY-MM-DD) 

•  Unemployment:  The  Google  Unemployment  Index  tracks  queries  related  to 
"unemployment,  social,  social  security,  unemployment  benefits"  and  so  on. 

•  Rental:  The  Google  Rental  Index  tracks  queries  related  to  “rent,  apartments,  for 
rent,  rentals,”  etc.  RealEstate:  The  Google  Real  Estate  Index  tracks  queries  related 
to  “real  estate,  mortgage,  rent,  apartments”  and  so  on. 

•  Mortgage:  The  Google  Mortgage  Index  tracks  queries  related  to  "mortgage, 
calculator,  mortgage  calculator,  mortgage  rates". 

•  Jobs:  The  Google  Jobs  Index  tracks  queries  related  to  "jobs,  city,  job,  resume, 
career,  monster"  and  so  forth. 

•  Investing:  The  Google  Investing  Index  tracks  queries  related  to  "stock,  finance, 
capital,  yahoo  finance,  stocks",  etc. 

•  DJI_Index:  The  Dow  Jones  Industrial  (DJI)  index.  These  data  are  interpolated 
from  5  records  per  week  (Dow  Jones  stocks  are  traded  on  week-days  only)  to 
7  days  per  week  to  match  the  constant  7-day  records  of  the  Google-Trends  data. 

•  StdDJI:  The  standardized-DJI  Index  computed  by:  StdDJI  =  3  +  (DJI- 11,091)/ 
1,501,  where  m  =  1 1,091  and  s  =  1,501  are  the  approximate  mean  and  standard- 
deviation  of  the  DJI  for  the  period  (2005-2011). 
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•  30-Day  Moving  Average  Data  Columns:  The  8  variables  below  are  the  30-day 
moving  averages  of  the  8  corresponding  (raw)  variables  above. 

-  Unemployment3 OM A,  Rental30MA,  RealEstate30MA,  Mortgage30MA, 
Jobs30MA,  Investing3 OMA,  DJI_Index30MA,  StdDJI_30MA . 

•  180-Day  Moving  Average  Data  Columns:  The  8  variables  below  are  the 
180-day  moving  averages  of  the  8  corresponding  (raw)  variables. 

-  Unemployment  1 80MA,  Rental  180MA,  RealEstate  1 80MA,  Mortgagel80MA, 
Jobsl80MA,  Investing  180MA,  DJI_Indexl80MA,  StdDJI_180MA. 

Here  we  use  the  RealEstate  as  our  dependent  variable.  Let’s  see  if  the  Google 
Real  Estate  Index  could  be  predicted  by  other  variables  in  the  dataset. 


11.2.2  Step  2:  Exploring  and  Preparing  the  Data 


First,  we  need  to  load  the  dataset  into  R. 

googLe<-read.  csv(  "https : //umich .  instructure.  com/ files/416274/doiAin  Load  PdoiAinlo 
ad_frd=l" j  stringsAsFactors  =  F) 

Let’s  delete  the  first  two  columns,  since  the  only  goal  is  to  predict  Google  Real 
Estate  Index  with  other  indexes  and  DJI. 


googie<-googie[j  - c(l , 
str(googLe) 

##  'data. frame' :  731 

##  $  Unemployment 

1.78  ... 

##  $  Rental 

.01  ... 

##  $  RealEstate 

0 . 89  . . . 

##  $  Mortgage 

•  •  • 

##  $  Jobs 

4  ... 

##  $  Investing 

.1  ... 

##  $  DJI_Index 

##  $  StdDJI 

•  • 

##  $  Unemployment_30MA 

44  ... 

##  $  Renta L_30MA 

0.79  ... 

##  $  RealEstate_30MA 

72  ... 

##  $  Mortgage_30MA 


2)] 

obs.  of  24  variables : 


num 

1.54 

1.56 

1.59 

1.62 

1.64 

1.64 

1.71 

1.85 

1.82 

num 

0.88 

0.9  0.92  0.92  6 

K94  e 

K96  e 

K99  1 

.02  1 

.02  1 

num 

0.79 

0.81 

0.82 

0.82 

0.83 

0.84 

0.86 

0.89 

0.89 

num 

1  1.05  1.07  1.08  1.1 

1.11 

1.15 

1.22 

\  1.23 

1.24 

num 

0.99 

1.05 

1.1  1 

.14  1 

.17  1 

.2  1. 

3  1.41  1.43  1.4 

num 

0.92 

0.94 

0.96 

0.98 

0.99 

0.99 

1.02 

1.09 

1.1  1 

num 

13044  13044  13057  12800  12827 

•  •  • 

num 

4.3  4 

1.3  4. 

31  4. 

14  4. 

16  4. 

16  4. 

16  4 

4.1  4 

.17  . 

num 

1.37 

1.37 

1.38 

1.38 

1.39 

1.4  1 

.4  1. 

42  1. 

43  1. 

num 

0.72 

0.72 

0.73 

0.73 

0.74 

0.75 

0.76 

0.77 

0.78 

num 

0.67 

0.67 

0.68 

0.68 

0.68 

0.69 

0.7  0.7  0. 

71  0. 

num 

0.98 

0.97 

0.97 

0.97 

0.98 

0.98 

0.98 

0.99 

0.99 
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1  ... 

##  $  Jobs_30MA  :  num 

1.06 

1.06 

1.05 

1.05 

1.05 

1.05 

1.05 

1.06 

1.07 

1.08  ... 

##  $  Investing_30MA  :  num 

0.99 

0.98 

0.98 

0.98 

0.98 

0.97 

0.97 

0.97 

0.98 

0 . 98  . . . 

##  $  DJI_Index_30MA  :  num 
##  $  StdDJI_30MA  :  num 

13405  13396  13390  13368  13342  . . . 

4.54  4.54  4.53  4.52  4.5  4.48  4.46  4.44  4.41  4 

•  4  •  •  • 

##  $  UnempLoyment_180MA:  num 

1.44 

1.44 

1.44 

1.44 

1.44 

1.44 

1.44 

1.44 

1.44 

1.44  ... 

##  $  Renta L_180MA  :  num 

0.87 

0.87 

0.87 

0.87 

0.87 

0.87 

0.86 

0.86 

0.86 

0.86  ... 

##  $  ReaLEstate_180MA  :  num 

0.89 

0.89 

0.88 

0.88 

0.88 

0.88 

0.88 

0.88 

0.88 

0.87  ... 

##  $  Mortgage_180MA  :  num 

1.18 

1.18 

1.18 

1.18 

1.17 

1.17 

1.17 

1.17 

1.17 

1.17  ... 

##  $  3obs_180MA  :  num 

1.24 

1.24 

1.24 

1.24 

1.24 

1.24 

1.24 

1.24 

1.24 

1.24  ... 

##  $  Investing_180MA  :  num 

1.04 

1.04 

1.04 

1.04 

1.04 

1.04 

1.04 

1.04 

1.04 

1.04  ... 

##  $  DJI_Index_180MA  :  num 
##  $  StdDJI_  1 80MA  :  num 

13493 

4.6  4 

I  13492  13489  13486  13482  . . . 

K6  4.6  4.6  4.59  4.59  4.59  4.58  4.58  4.58 

As  we  can  see  from  the  structure  of  the  data,  these  indices  and  DJI  have  different 
ranges.  We  should  rescale  the  data.  In  Chap.  6,  we  learned  that  normalizing  these 
features  using  our  own  normal  i  ze  ( )  function  provides  one  solution.  We  can  use 
lapply  ( )  to  apply  the  normalize  ( )  function  to  each  column. 

normaLize  <-  function(x)  { 

ret urn ((x  -  min(x))  /  (max(x)  -  min(x))) 

} 

goog Le_norm< -os . data . frame ( LappLy ( goog Le ,  normaLize) ) 
summary ( goog Le_norm$ReaLE state ) 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  0.0000  0.4615  0.6731  0.6292  0.8077  1.0000 

Looks  like  all  the  vectors  are  normalized  into  the  [0,  1]  range. 

The  next  step  would  be  to  split  the  google  dataset  into  training  and  testing 
subsets.  This  time  we  will  use  the  sample  ( )  and  floor  ( )  function  to  separate  the 
training  and  testing  sets,  sample  ( )  is  a  function  to  create  a  set  of  indicators  for  row 
numbers.  We  can  subset  the  original  dataset  with  random  rows  using  these  indica¬ 
tors.  floor  ( )  takes  a  number  r  and  returns  the  closest  integer  to  r 


sample  (row,  size) 


•  row:  rows  in  the  dataset  that  you  want  to  select  from.  If  you  want  to  select  all  of 
the  rows,  you  can  use  nrow(data)  or  1  :  nrow(data)(sing\Q  number  or  vector). 

•  size:  how  many  rows  you  want  for  your  subset. 
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sub< - sample (nrow (goog Le_norm) ,  fLoor(nrow(googLe_norm)*0. 75)) 
googLe_train<-googLe_norm[subj  ] 
goog Le_test< -goog Le_norm[ -sub j  ] 

We  are  good  to  go  and  can  move  forward  to  the  model  training  phase. 


11.2.3  Step  3:  Training  a  Model  on  the  Data 


Here,  we  use  the  function  neural  net  ( )  in  package  neural  net.  neural  net 

returns  a  NN  object  containing: 

•  call:  the  matched  call. 

•  response:  extracted  from  the  data  argument. 

•  co variate:  the  variables  extracted  from  the  data  argument. 

•  model.list:  a  list  containing  the  covariates  and  the  response  variables  extracted 
from  the  formula  argument. 

•  err.fct  and  act.fct:  the  error  and  activation  functions. 

•  net.result:  a  list  containing  the  overall  result  of  the  neural  network  for  every 
repetition. 

•  weights:  a  list  containing  the  fitted  weights  of  the  neural  network  for  every 
repetition. 

•  result. matrix:  a  matrix  containing  the  reached  threshold,  needed  steps,  error,  AIC, 
BIC,  and  weights  for  every  repetition.  Each  column  represents  one  repetition. 


m<-neuralnet  ( target^predictors ,  data=mydata,  hidden=l), 
where: 


•  target:  variable  we  want  to  predict. 

•  predictors:  predictors  we  want  to  use.  Note  that  we  cannot  use  " to  denote  all  the 
variables  in  this  function.  We  have  to  add  all  predictors  one  by  one  to  the  model. 

•  data:  training  dataset. 

•  hidden:  number  of  hidden  nodes  that  we  want  to  use  in  the  model.  By  default,  it  is 
set  to  one. 

•  install . packages ("neunalnet" ) 

Library (neuraLnet) 

goog Le_modeL<-neuraLnet( Rea LEstate~UnempLoyment+RentaL+Mortgage+Jobs+Investi 
ng+DJI_Index+StdDJIj  data=goog Le_train ) 
pLot(googie_modeL ) 

Figure  1 1.7  shows  that  we  have  only  one  hidden  node.  Error  stands  for  the  sum 
of  squared  errors  and  Steps  is  how  many  iterations  the  model  had  to  go  through. 
These  outputs  could  be  different  when  you  run  the  exact  same  code  because  the 
weights  are  stochastically  generated. 
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Fig.  11.7  A  simple  neural 
network  predicting  the  real 
estate  prices  using  Google 
market  data 


11.2.4  Step  4:  Evaluating  Model  Performance 

Similar  to  the  predict  ( )  function  that  we  have  mentioned  in  previous  Chapters, 
compute  ()  is  an  alternative  method  that  helps  with  ANN  model  predictions. 


p<  —  compute  (m,  test) 


•  m:  a  trained  neural  networks  model. 

•  test:  the  test  dataset.  This  dataset  should  only  contain  the  same  type  of  predictors 
in  the  neural  network  model. 

In  our  model,  we  picked  Unemployment,  Rental,  Mortgage,  Jobs, 
Investing,  DJI_Index,  StdDJI  as  our  predictors.  So,  we  need  to  find  these 
corresponding  column  numbers  in  the  test  dataset  (1,  2,  4,  5,  6,  7,  8,  respectively). 

googLe_pred<- compute (googLe_modeLj  googLe_test[j  c(l:2,  4:8)]) 
pred_resuLts<-googLe_pred$net .result 
cor(pred_resultSj  google_test$RealEstate) 

##  [}1] 

##  [lj]  8.9653369986 

As  mentioned  in  Chap.  9,  we  can  still  use  the  correlation  between  predicted 
results  and  observed  Real  Estate  Index  to  evaluate  the  data.  A  correlation  over  0.96  is 
very  good  for  real  world  datasets.  Could  this  still  be  improved? 
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11.2.5  Step  5:  Improving  Model  Performance 

This  time  we  will  include  four  hidden  nodes  in  the  model.  Let’s  see  what  results  we 
can  get  from  this  more  elaborate  ANN  model  (Fig.  11.8). 


googLe_modeL2<-neuraLnet(ReaLEstcite~UnempLoyment+RentciL+Mortgcige+Jobs+Invest 

ing+DJI_Index+StdDJIj  data=googLe_trainj  hidden  =4) 

pLot(googLe_modeL2) 

Although  the  graph  looks  complicated,  we  have  smaller  Error,  or  sum  of 
squared  errors.  Neural  net  models  may  be  used  for  classification  and  regression , 
which  we  will  see  in  the  next  part.  Let’s  first  try  regression  tree  modeling. 


Error:  0.33741 1  Steps:  1 932 


Fig.  11.8  A  more  elaborate  neural  network  showing  decreased  prediction  error,  compare  to 
Fig.  11.7 
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googLe_pred2<-compute(googLe_modeL2j  googLe_test[ ,  c(l:2}  4:8)]) 
pred_resuLts2<-googLe_pred2$net . result 
cor(pred_results2j  google_test$ReaLEstate) 

##  [A] 

##  [lj]  0.9869109731 

We  get  an  even  higher  correlation.  This  is  almost  an  ideal  result!  The  predicted 
and  observed  indices  have  a  very  strong  linear  relationship.  Nevertheless,  too  many 
hidden  nodes  might  even  decrease  the  correlation  between  predicted  and  true  values, 
which  will  be  examined  in  the  practice  problems  later  in  this  Chapter. 


11.2.6  Step  6:  Adding  Additional  Layers 

We  observe  an  even  lower  Error  by  using  three  hidden  layers  with  numbers  of 
nodes  4,3,3  within  each,  respectively. 

googLe_modeL2<-neuraLnet(ReaLEstate~UnempLoyment+RentaL+Mortgage+Jobs+Invest 
±ng+DJI_Index+StdDJIj  data=google_trainJ  hidden  =  c(4j3j3)) 
googLe_pred2<-compute(googLe_model2j  google_test[ j  c(l:2j  4:8)]) 
pred_results2<-google_pred2$net . result 
cor(pred_results2j  google_test$RealEstate) 

##  [A] 

##  [lj]  0.9853727545 


11.3  Simple  NN  Demo:  Learning  to  Compute  V 

This  simple  example  demonstrates  the  foundation  of  the  neural  network  prediction 
of  a  basic  mathematical  function  (square-root):  (Fig.  11.9). 


Fig.  11.9  The  square-root 
function  evaluated  at 
random  Uniform(0,100) 
values 
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#  generate  random  training  data:  1,000  |X_i|,  where  X_i  ~  Uniform  (0,10)  or 
perhaps  ~  N(0,1) 

rand_data  <-  abs(runif(1000j  0,  100)) 

#  create  a  2  column  data-frame  (and_data,  sqrt_data) 
sqrt_df  <-  data. frame (rand_data,  sqrt_data=sqrt(rand_data) ) 
pLot(rand_dataj  sqrt_df$sqrt_data) 


#  Train  the  neural  net 

net.sqrt  <-  neurainet(sqrt_data  ~  rand_dataj  sqrt_dfj  hidden=10j 
threshoLd=0. 01 ) 

#  report  the  NN 

#  print(net . sqrt) 

#  generate  testing  data  seq(from=0. 1,  to=N,  step=0.1) 

N  <-  200.0  #  out  of  range  [100:  200]  is  also  included  in  the  testing! 

test_data  <-  seq(0j  N,  0.1);  test_data_sqrt  <-  sqrt(test_data) 

#  try  to  predict  the  square-root  values  using  10  hidden  nodes 

#  Compute  or  predict  for  test  data,  test_data 
pred_sqrt  <-  compute (net . sqrt j  test_data)$net .result 

#  compute  uses  the  trained  neural  net  (net.sqrt), 

#  to  estimate  the  square-roots  of  the  testing  data 

#  compare  real  (test_data_sqrt)  and  NN-predicted  (pred_sqrt)  square 
roots  of  test_data 

piot(pred_sqrtj  test_data_sqrtj  xLim=c(0j  12) ,  yLim=c(0j  12)); 
abiine(0jlj  coi="red" ,  Lty=2) 

Legend (" bottomright" j  c("Pred  vs.  Actual  SQRT"j  "Pred=Actual  Line"), 
cex=0.8 ,  Lty=c(lj2)j  Lwd=c(2, 2),  col=c( "black" ,  "red")) 


compare_df  < -data. frame (pred_sqrt,  test_data_sqrt) ;  #  compare_df 
plot(test_data ,  test_data_sqrt) 

Lines (test_data,  pred_sqrt ,  pch=22 ,  col="red" ,  lty=2) 

Legend (" bottomright" ,  c("Actual  SQRT", "Predicted  SQRT"),  Lty=c(l,2),  lwd=c 
(2,2) ,col=c(" black" ,  "red")) 

We  observe  that  the  NN,  net .  sqrt  actually  learns  and  predicts  the  complex 
square  root  function  with  good  accuracy,  Figs  11.10  and  1 1 . 1 1 .  Of  course,  individual 
results  may  vary,  as  we  randomly  generate  the  training  data  (rand_data)  and  due  to 
the  stochastic  construction  of  the  ANN. 


396  11  Black  Box  Machine-Learning  Methods:  Neural  Networks  and  Support  Vector  Machines 


Fig.  11.10  Test  data 
validation  of  the  neural 
network  predicting  the 
behavior  of  the  square-root 
function 


Fig.  11.11  Plots  illustrating 
the  agreement  of  the 
NN-predicted  and  the 
analytical  square-root 
function 


test  data 


11.4  Case  Study  2:  Google  Trends  and  the  Stock  Market  - 
Classification 


In  practice,  ANN  models  are  also  useful  as  classifiers.  Let’s  demonstrate  this  by 
using  again  the  Stock  Market  data.  We  will  binarize  the  samples  according  to  their 
RealEstate  values.  For  those  higher  than  the  75%,  we  will  lable  them  0; 
For  those  lower  than  the  25%,  we  will  label  them  2;  all  others  will  be  labeled  1. 
Even  in  the  classification  setting,  the  response  still  must  be  numeric. 

googie_ciass  =  googLe_norm 

idl  =  which (goog Le_cLass$ReaLEstate>quantiLe (goog Le_cLass$ReaLEstatej  0.75)) 

id2  =  which (goog Le_cLass$ReaLEstate<quantiLe (goog Le_cLass$ReaLEstatej 0.25)) 

id3  =  setdiff(l:nrow(googLe_cLass)jUnion(idljid2)) 

goog Le_c Lass$Rea LE state [ idl ]=0 

googie_c Lass$ReaLE state [id2]=l 

goog Le_c Lass$Rea LE state [id3] =2 

summary ( as .  factor ( goog Le_cL ass$Rea LEstate)) 

##012 
##  179  178  374 

Here,  we  divide  the  data  to  training  and  testing  sets.  We  need  three  more  column 
indicators  that  correspond  to  the  three  derived  RealEstate  labels. 
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set.seed(201 7) 

train  =  sample(l:nrow(google_class) j0.7*nrow(google_class)) 

googie_tr  =  google_class[trainj ] 

googLe_ts  =  googie_ciass[ -train j ] 

train_x  =  googLe_tr[ , c(l : 2,4: 8) ] 

train_y  =  google_tr[ , 3] 

coLnames(train_x) 

##  [1]  "Unemployment"  "Rental"  "Mortgage"  "Jobs" 

##  [5]  "Investing"  "DJI_Index"  "StdDJI" 

test_x  =  google_ts[ ,c(l: 2,4:8) ] 
test_y  =  google_ts[3] 

train_y_ind  =  model .matrix(~f actor (train_y) -1) 
colnames(train_y_ind)  =  c("High"j  "Median" j  "Low") 
train  =  cbind(train_x,  train_y_ind) 

We  use  non-linear  output  and  display  every  2,000  iterations. 


nn_single  =  neuraLnet(High+Median+Low~Unemployment+Rental+Mortgage+Jobs+Inve 
sting+DJI_Index+StdDJI, 
data  =  train, 
hidden =4 , 

Linear .output=FALSEj 

Lifesign=  'full Life sign . step=2000) 

steps : 2000  min  thresh:  0.13702015 


## 

hidden: 

4 

thresh. 

■0.01  rep: 1/1 

48 

## 

4000 

min 

thresh. 

0.08524054094 

## 

6000 

min 

thresh. 

0.08524054094 

## 

8000 

min 

thresh. 

0.08524054094 

## 

10000 

min 

thresh. 

0.08524054094 

## 

40000 

min 

thresh 

0.02427719823 

## 

42000 

min 

thresh 

0.02158221449 

## 

44000 

min 

thresh 

0.01831644589 

## 

46000 

min 

thresh 

0.01682874388 

## 

48000 

min 

thresh 

0.01572773551 

## 

50000 

min 

thresh 

0.01311388938 

## 

52000 

min 

thresh 

0.01241004281 

## 

54000 

min 

thresh 

0.01131407008 

## 

55420 

error:  7.01191  time:  19.3 

Below  is  the  prediction  function  translating  this  model  to  generate  forecasting 
results. 


pred  =  function(nn ,  dat)  { 

#  compute  uses  the  trained  neural  net  (nn=nn_single) ,  and 

#  new  testing  data  (dat=google_ts)  to  generate  predictions  (y_hat) 

#  compute  returns  a  list  containing: 

#  (1)  neurons:  a  list  of  the  neurons'  output  for  each  layer  of  the 
neural  network,  and 

#  (2)  net. result:  a  matrix  containing  the  overall  result  of  the 
neural  network. 

yhat  =  compute(nn ,  dat)$net . result 
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#  find  the  maximum  in  each  row  (1)  in  the  net. result  matrix 

#  to  determine  the  first  occurrence  of  a  specific  element  in  each  row  (1) 

#  we  can  use  the  apply  function  with  which. max 
yhat  =  appiy(yhatj  1}  which. max) -1 
return(yhat) 

} 

mean(pred(nn_singLeJ  googLe_ts[jC(l:2j4:8)])  !=  as.factor(googLe_ts[j3])) 

##  [1]  0.03181818182 

Now  let’s  inspect  the  structure  of  the  Neural  Network. 


piot(nn_singie) 

Similarly,  we  can  change  hidden  to  utilize  multiple  hidden  layers.  However,  a 
more  complicated  model  won’t  necessarily  guarantee  an  improved  performance. 


nn_singLe  =  neuraLnet(High+Median+LoiA/~UnempLoyment+RentaL+Mortgage+Jobs+Inve 
sting+DJI_Index+StdDJIj 
data  =  trairtj 
hidden=c(4j 5) } 

Linear . output=FALSE  j 

Lifesign=  'fuLL  Life  sign . step=2000) 

##  hidden:  4,  5  thresh:  0.01  rep: 1/1  steps : 2000  min  thresh:  0.307 


## 

4000 

min 

thresh 

0.2875517033 

## 

6000 

min 

thresh 

0.1383720887 

## 

8000 

min 

thresh 

0.1115440575 

## 

10000 

min 

thresh 

0.09233958192 

## 

12000 

min 

thresh 

0.0766173347 

## 

14000 

min 

thresh 

0.05763223509 

## 

16000 

min 

thresh 

0.03417989426 

## 

18000 

min 

thresh 

0.01473872843 

## 

20000 

min 

thresh 

0.01101646653 

## 

20741 

error:  7.00627  time:  11. 

mean(pred(nn_singLej  googLe_ts[jC(l:2j4:8)])  !=  as.factor(googLe_ts[j3])) 
##  [1]  0.03181818182 


11.5  Support  Vector  Machines  (SVM) 

Recall  that  the  Lazy  learning  methods  in  Chap.  6  assigned  class  labels  using 
geometrical  distances  of  different  features.  In  multidimensional  spaces  (multiple 
features),  we  can  use  spheres  with  centers  determined  by  the  training  dataset.  Then, 
we  can  assign  labels  to  testing  data  according  to  their  nearest  spherical  center.  Let’s 
see  if  we  make  a  choose  other  hypersurfaces  that  may  separate  nD  data  and  indice  a 
classification  scheme. 
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Fig.  11.12  Schematic 
representation  of  linear- 
kernel  SVM  classification 


11.5.1  Classification  with  Hyperplanes 

The  easiest  shape  would  be  a  plane.  Support  Vector  Machine  (SVM)  can  use 
hyperplanes  to  separate  data  into  several  groups  or  classes.  This  is  used  for  datasets 
that  are  linearly  separable.  Assume  that  we  have  only  two  features,  will  you  use  the 
A  or  B  hyperplane  to  separate  the  data  on  Fig.  11.12?  Perhaps  even  another 
hyperplane,  C? 


Finding  the  Maximum  Margin 

To  answer  the  above  question,  we  need  to  search  for  the  Maximum  Margin 
Hyperplane  (MMH).  That  is  the  hyperplane  that  creates  the  greatest  separation 
between  the  two  closest  observations. 

We  define  support  vectors  as  the  points  from  each  class  that  are  closest  to  the 
MMH.  Each  class  must  have  at  least  one  observation  as  a  support  vector. 

Using  support  vectors  alone  is  not  sufficient  for  finding  the  MMH.  Although 
tricky  mathematical  calculations  are  involved,  the  fundamental  process  is  fairly 
simple.  Let’s  look  at  linearly  separable  data  and  non-linearly  separable  data 
individually. 


Linearly  Separable  Data 

If  the  dataset  is  linearly  separable,  we  can  find  the  outer  boundaries  of  our  two 
groups  of  data  points.  These  boundaries  are  called  convex  hull  (red  lines  in  the 
following  graph).  The  MMH  (black  solid  line)  is  the  line  that  is  perpendicular  to  the 
shortest  line  between  the  two  convex  hulls  (Fig.  11.13). 
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Fig.  11.13  Convex  hulls  of 
the  linearly  separable  groups 
of  points 


An  alternative  way  would  be  picking  two  parallel  planes  that  can  separate  the  data 
into  two  groups  while  the  distance  between  two  planes  is  as  far  as  possible. 

To  mathematically  define  a  plane,  we  need  to  use  vectors.  In  ^-dimensional 
spaces  planes  could  be  expressed  by  the  following  equation: 


w  •  x  +b  =  0, 


where  w  (weights)  and  v  both  have  n  coordinates,  and  b  is  a  single  number  known 
as  the  bias. 

To  clarify  this  notation,  let’s  look  at  the  situation  in  a  3D  space  where  we  can 
express  (embed)  2D  Euclidean  planes  given  a  point  ((x0,  y0 ,  z0))  and  a  normal-vector 
((a,  b,  c))  form.  This  is  just  a  linear  equation,  where  d  =  —  ( ax0  +  by0  +  czo ): 

ax  H-  by  -f  cz  +  d  =  0, 


or  equivalently 

w\X\  +  W2X2  +  W3V3  +  b  =  0. 

We  can  see  that  it  is  equivalent  to  the  vector  notation. 

Using  the  vector  notation,  we  can  specify  two  hyperplanes  as  follows: 

w  •  v  +b  >  +1 


and 

— y  — y 

w  •  x  +b  <  —  1 . 


We  require  that  all  class  1  observations  fall  above  the  first  plane  and  all  obser¬ 
vations  in  the  other  class  fall  below  the  second  plane. 

The  distance  between  two  planes  is  calculated  as: 


2 
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where  ||w||  is  the  Euclidean  norm.  To  maximize  the  distance,  we  need  to  minimize 
the  Euclidean  norm. 


w 


To  sum  up  we  are  going  to  find  min——  with  the  following  constrain 


y,  (vv-  x  - b )  >  1,  Vx,-, 


where  V  means  “for  all”. 

For  each  nonlinear  programming  problem,  called  the  primal  problem,  there  is  a 
related  nonlinear  programming  problem,  called  the  Lagrangian  dual  problem. 
Under  certain  assumptions  for  convexity  and  suitable  constraints,  the  primal  and 
dual  problems  have  equal  optimal  objective  values.  Primal  optimization  problems 
are  typically  described  as: 


minf(x ) 


subject  to 

8i(x)  <  0 

hj(x)  =  0 


The  Lagrangian  dual  problem  is  defined  as  a  parallel  nonlinear  programming 
problem: 

min  (0(m,v)) 

W,  V 

subject  to  u  >  0, 

where 

0(u,  v)  =  inf  I  f(x)  +  Y  Uigi{x)  +  Y  VMX ) 

V  *  j 

Chapter  21  provides  additional  technical  details  about  optimization  duality. 
Suppose  the  Lagrange  primal  is: 


lp- 2 


1 


w 


—  oti  yt  (wo  +  xjw)  —  l] ,  where  ai  >  0. 


i=\ 


To  optimize  that  objective  function,  we  can  set  the  partial  derivatives  equal  to 
zero: 

a  ^ 

:  w  =  2_^  atfiXi 

i=  1 


dw 

_a_ 

db 


n 


:  0  =  Y 


i= 1 


402  11  Black  Box  Machine-Learning  Methods:  Neural  Networks  and  Support  Vector  Machines 


Substituting  into  the  Lagrange  primal,  we  obtain  the  Lagrange  dual: 


n 


-  E 


i=  1 


n 

El  /  T  / 

xi- 
i=  1 


n 

We  maximize  LD  subject  to  at  >  0  and  aiyt  =  0. 


i=\ 


a 


The  Karush-Kuhn-Tucker  optimization  conditions  suggest  that  we  have 

yi(b  +xjw)  -  l]  =  0. 


Which  implies  that  if  yfi  (xt)  >  1,  then  =  0.  The  support  of  a  function  (/)  is 
the  smallest  subset  of  the  domain  containing  only  arguments  (v)  which  are  not 
mapped  to  zero  (fix)  ^  0).  In  our  case,  the  solution  w  is  defined  in  terms  of  a  linear 
combination  of  the  support  points: 


n 

f  0)  =  WTX  =  aiytxi- 

i=  1 

That’s  where  the  name  of  Support  Vector  Machines  (SVM)  comes  from. 


Non-linearly  Separable  Data 

For  non-linearly  separable  data,  we  need  to  use  a  small  trick.  We  still  use  a  plane,  but 
allow  some  of  the  points  to  be  misclassified  into  the  wrong  class.  To  penalize  for 
that,  we  add  a  cost  term  after  the  Euclidean  norm  function  that  we  need  to  minimize. 
Therefore,  the  solution  changes  to: 


min 


.t.yj(w-  x  - b )  >  1  >  0, 


where  C  is  the  cost  and  &  is  the  distance  between  the  misclassified  observation  i  and 
the  plane. 

We  have  Lagrange  dual  problem: 


_  1 

Lp  ~2 


n 


n 


n 


W 


+  C  E  ^  “  E  a‘  [x-  ( b  +  XIW )  -  ( ' 1  -  £;)]  -  E  ^ 


i=  1 


i=  1 


i=  1 


where 

(*i,Yi  >  0. 

Similar  to  what  we  did  above  for  the  separable  case,  we  can  use  the  derivatives  of 
the  primal  problem  to  solve  the  dual  problem. 
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Notice  the  inner  product  in  the  final  expression.  We  can  replace  this  inner  product 
with  a  kernel  function  that  maps  the  feature  space  into  a  higher  dimensional  space 
(e.g.,  using  a  polynomial  kernel)  or  an  infinite  dimensional  space  (e.g.,  using  a 
Gaussian  kernel). 


Using  Kernels  for  Non-linear  Spaces 

An  alternative  way  to  solve  for  the  non-linear  separable  is  called  the  kernel  trick. 
That  is  to  add  some  dimensions  (or  features)  to  make  these  non-linear  separable  data 
to  be  separable  in  a  higher  dimensional  space. 

How  can  we  do  that?  We  transform  our  data  using  kernel  functions.  A  general 
form  for  kernel  functions  would  be: 

K  (xi ,  xj)  =  <p  (xi)  ■  (f>  (xj) 

The  kernel  is  a  mapping  of  the  data  into  another  space. 

The  linear  kernel  would  be  the  simplest  one  that  is  just  the  dot  product  of  the 
features. 


K (xj ,  Xj)  =Xi  -  xj. 

The  polynomial  kernel  of  degree  d  transforms  the  data  by  adding  a  simple 
non-linear  transformation  of  the  data. 

/ — ^  ^  \  / — ^  ^  \  ^ 

K [Xi ,  Xj)  =  [Xi  •  Xj  +  1)  . 

The  sigmoid  kernel  is  very  similar  to  neural  network.  It  uses  a  sigmoid  activation 
function. 


K  (xi ,  xj)  =  tank  (kxt  •  xj  —  S) . 

The  Gaussian  RBF  kernel  is  similar  to  RBF  neural  network  and  is  a  good  place  to 
start  investigating  a  dataset. 

K  (xi ,  Xj)  =  exp 


11.6  Case  Study  3:  Optical  Character  Recognition  (OCR) 

This  example  illustrates  interpreting,  processing  and  recognizing  handwritten  notes 
(text).  Specifically,  we  will  convert  handwritten  characters  (unstructured  image  data) 
to  printed  text  (typeset  characters). 
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Fig.  11.14  Example  of  the 
preprocessed  gridded 
handwritten  letters 


Protocol: 

•  Divide  the  image  (typically  optical  image  of  handwritten  notes  on  paper)  into  a 
fine  grid  where  each  cell  contains  one  glyph  (symbol,  letter,  number). 

•  Match  the  glyph  in  each  cell  to  one  of  the  possible  characters  in  a  dictionary. 

•  Combine  individual  characters  together  into  words  to  reconstitute  the  digital 
representation  of  the  optical  image  of  the  handwritten  notes. 

In  this  example,  we  use  an  optical  document  image  (data)  that  has  already  been 
pre-partitioned  into  rectangular  grid  cells  containing  one  character  of  the  26  English 
letters,  A  through  Z. 

The  resulting  gridded  dataset  is  distributed  by  the  UCI  Machine  Learning  Data 
Repository.  The  dataset  contains  20,000  examples  of  26  English  capital  letters 
printed  using  20  different  randomly  reshaped  and  morphed  fonts  (Fig.  11.14). 


11.6.1  Step  1:  Prepare  and  Explore  the  Data 

#  read  in  data  and  examine  structure 

hand_Letters  <-  read .  csv (  "https : //umich .  instructure.  com/fiLes/2837863/downLo 
ad?do]Ainioad_frd=l"j  header  =  T) 
str(hand_Letters ) 

##  ' data. frame  ' :  20000  obs.  of  17  variables : 

##  $  Letter:  Factor  w/  26  Levels  "A ",  "B ",  "C ",  "D ", . . :  20  9  4  14  7  19  2  1  10 

13  ... 

##  $  xbox  :  int  254724412  11... 

##  $  ybox  :  int  8  12  11  11  1  11  2  1  2  15  .. . 

##  $  width  :  int  336635534  13... 

##  $  height:  int  5786184249... 
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## 

$ 

onpix 

int 

1 

263134127. 

•  • 

## 

$ 

xbar 

int 

8 

10  10  5  8  8  8  8  10 

13  ... 

## 

$ 

ybar 

int 

13  569687262 

•  •  • 

## 

$ 

x2bar 

int 

0 

524666226. 

## 

$ 

y2bar 

int 

6 

466696262. 

## 

$ 

xybar 

int 

6 

13  10  4  6  5  7  8  12 

12  ... 

## 

$ 

x2ybar 

int 

10  334566241 

•  •  • 

## 

$ 

xy2bar 

int 

8 

97  10  966889 

•  •  • 

## 

$ 

xedge 

int 

0 

236102118. 

## 

$ 

xedgey 

int 

8 

87  10  788661 

•  •  • 

## 

$ 

yedge 

int 

0 

432597211 

## 

$ 

yedgex 

int 

8 

10  9  8  10  7  10  7  7 

8  ... 

#  divide  into  training  (3/4)  and  testing  (1/4)  data 
hand_Letters_train  <-  hand_Letters[l:15000j  ] 
hand_Letters_test  <-  hand_Letters [15001: 20000 ,  ] 


11.6.2  Step  2:  Training  an  SVM  Model 

We  can  specify  vanilladot  as  a  linear  kernel,  or  alternatively: 

•  rbf  dot  Radial  Basis  kernel  i.e.,  “Gaussian” 

•  polydot  Polynomial  kernel 

•  tanhdot  Hyperbolic  tangent  kernel 

•  lap  lace  dot  Laplacian  kernel 

•  be ssel dot  Bessel  kernel 

•  anovadot  ANOVA  RBF  kernel 

•  spline  dot  Spline  kernel 

•  stringdot  String  kernel 


#  begin  by  training  a  simple  linear  SVM 
Library ( kern  Lab ) 

set.seed(123) 

hand_Letter_cLassifier  <-  ksvm( Letter  ~  data  =  hand_Letters_trainj 
kerne L  =  "vaniLLadot" ) 

##  Setting  defauLt  kerne L  parameters 

#  look  at  basic  information  about  the  model 
hand_Letter_cLassifier 

##  Support  Vector  Machine  object  of  cLass  "ksvm" 

## 

##  SV  type:  C-svc  (cLassification) 

##  parameter  :  cost  C  =  1 
## 

##  Linear  (vaniLLa)  kerneL  function. 

## 

##  Number  of  Support  Vectors  :  6618 
## 
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##  Objective  Function  Value  :  -13.2947  -19.6051  -20.8982  -5.6651  -7.2092  -31 
.5151  -48.3253  -17.6236  -57.0476  -30.532  -15.7162  -31.49  -28.2706  -45.741  -1 
1.7891  -33.3161  -28.2251  -16.5347  -13.2693  -30.88  -29.4259  -7.7099  -11.1685 
-29.4289  -13.0857  -9.2631  -144.1105  -52.7747  -71.052  -109.7783  -158.3152  -51 
.2839  -39.6499  -67.0061  -23.8637  -27.6083  -26.3461  -35.2626  -38.6346  -116.89 
67  -173.8336  -214.2196  -20.7925  -10.3812  -53.1156  -12.228  -46.6132  -8.6867  - 
18.9108  -11.0535  -94.5751  -26.5689  -224.0215  -70.5714  -8.3232  -4.5265  -132.5 
431  -74.6876  -19.5742  -12.7352  -81.7894  -11.6983  -25.4835  -17.582  -23.934  -2 
7.022  -50.7092  -10.9228  -4.3852  -13.7216  -3.8547  -3.5723  -8.419  -36.9773  -47 
.1418  -172.6874  -42.457  -44.0342  -42.7695  -13.0527  -16.7534  -78.7849  -101.81 
46  -32.1141  -30.3349  -104.0695  -32.1258  -24.6301  -32.6087  -17.0808  -5.1347  - 
40.5505  -6.684  -16.2962  -56.364  -147.3669  -49.0907  -37.8334  -32.8068  -73.248 
-127.7819  -10.5342  -5.2495  -11.9568  -30.1631  -135.5915  -51.521  -176.2669  -99 
.0973  -10.295  -14.5906  -3.7822  -64.1452  -7.4813  -84.9109  -40.9146  -87.2437  - 
66.8629  -69.9932  -20.5294  -12.7577  -7.0328  -22.9219  -12.3975  -223.9411  -29.9 
969  -24.0552  -132.6252  -133.7033  -9.2959  -33.1873  -5.8016  -57.3392  -60.9046 
-27.1766  -200.8554  -29.9334  -15.9359  -130.0183  -154.4587  -43.5779  -24.4852  - 
135.7896  -74.1531  -303.5043  -131.4741  -149.5403  -30.4917  -29.8086  -47.3454  - 
24.6204  -44.2792  -6.2064  -8.6708  -36.4412  -68.712  -179.7303  -44.7489  -84.860 
8  -136.6786  -569.3398  -113.0779  -138.435  -303.8556  -32.8011  -60.4546  -139.35 
25  -108.9841  -34.277  -64.9071  -38.6148  -7.5086  -204.222  -12.9572  -29.0252  -2 
.0352  -5.9916  -14.3706  -21.5773  -57.0064  -19.6546  -178.0543  -19.812  -4.145  - 
4.5318  -0.8101  -116.8649  -7.8269  -53.3445  -21.4812  -13.5066  -5.3881  -15.1061 
-27.6061  -18.9239  -68.8104  -26.1223  -93.0231  -15.1693  -9.7999  -7.6137  -1.530 
1  -84.9531  -5.4551  -93.187  -93.4153  -43.8334  -23.6706  -59.1468  -22.0933  -47. 
8381  -219.9936  -39.5596  -47.2643  -34.0752  -20.2532  -11.239  -118.4152  -6.4126 
-5.1846  -8.7272  -9.4584  -20.8522  -22.0878  -113.0806  -29.0912  -80.397  -29.620 
6  -13.7422  -8.9416  -3.0785  -79.842  -6.1869  -13.9663  -63.3665  -93.2067  -11.55 
93  -13.0449  -48.2558  -2.9343  -8.25  -76.4361  -33.5374  -109.112  -4.1731  -6.197 
8  -1.2664  -84.1287  -18.3054  -7.2209  -45.5509  -3.3567  -16.8612  -60.5094  -43.9 
956  -53.0592  -6.1407  -17.4499  -2.3741  -65.023  -102.1593  -103.4312  -23.1318  - 
17.3394  -50.6654  -31.4407  -57.6065  -19.6857  -5.2667  -4.1767  -55.8445  -30.92 
-57.2396  -30.1101  -7.611  -47.7711  -12.1616  -19.1572  -53.5364  -3.8024  -53.124 
-225.6075  -12.6791  -11.5852  -16.6614  -9.7186  -65.824  -16.3897  -42.3931  -50.5 
13  -24.752  -14.513  -40.495  -16.5124  -57.1813  -4.7974  -5.2949  -81.7477  -3.272 
-6.3448  -1.1259  -114.3256  -22.3232  -339.8619  -31.0491  -31.3872  -4.9625  -82.4 
936  -123.6225  -72.8463  -23.4836  -33.1608  -11.7133  -19.7607  -1.8599  -50.1148 
-8.2868  -143.3592  -1.8508  -1.9699  -9.4175  -0.5202  -25.0654  -30.0489  -5.6248 
##  Training  error  :  0.129733 


11.6.3  Step  3:  Evaluating  Model  Performance 

Let’s  assess  the  SVM  prediction  using  the  testing  data. 


#  predictions  on  testing  dataset 

hand_Letter_predictions<-  predict (hand_letter_c Las  si fier}  hand_Letters_test) 

head( hand_  Letter_predictions ) 

##  [1]  C  U  K  U  E  I 

##  Levels:  ABCDEFGHIJKLMNOPQRSTUVIaIXYZ 
table ( hand_letter_predictionSj  hand_letters_test$ Letter ) 

## 
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##  hand_ 
## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

##  hand_ 
## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


Letter_predictions  A  B 

A  191  0 

B  0  157 

C  0 

D  1 

E  0 

F  0 

G  1 

H  0 

I  0 

J  2 

K  2 

L  0 

M  0 

N  0 

0  0 

P  0 

Q  0 

R  2 

S  1 

T  0 

U  1 

V  0 

IaI  0 

X  0 

Y  3 

Z  2 


Letter_predictions 

A 

B 

C 

D 

E 

F 

G 

H 

I 

J 

K 

L 


C  D 
1  0 
0  9 

0  142  0 

1  0  196 

8  0 
0  0 
4  1 

0  1 
0  0 
0  0 
11  0 
0  0 
1  1 
0  1 
1  2 
0  1 
0  0 
0  1 
0  0 
0  0 
3  3 

0  0 
0  0 
0  0 
0  0 
0  0 


0 

0 

1 

3 

0 

2 

1 

0 

0 

0 

0 

0 

0 

5 

2 

0 

0 

0 

0 

1 

0 

0 


E  F 
0  0 
2  0 
5  0 

0  1 
164  2 

0  171 


M  N  0  P 

12  2  0 

3  0  0  2 

0  0  2  0 

0  6  5  3 

0  0  0  0 

0  0  0  18 

10  0  2 

2  5  23  0 

0  0  0  1 

0  0  11 

0  2  0  1 

0  0  0  0 

M  177  5  1  0 

N  0  1  72  0  0 

0  0  1  132  2 

P  0  0  3  168 

Q  0  0  5  1 

R  1  1  1  1 

S  0  0  0  0 

T  0  0  0  0  0 

U  0  1  0  1  0 

V  0  3  1  0  2 

IaI  2  0  4  0  0 

X  0  0  1  0  0 

Y  0  0  0  6  0 

Z  0  0  0  0  1 


10 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

1 

1 

3 
0 
0 
0 
2 
0 
2 

Q 

5 

4 
0 
1 

6 
0 

11 

2 

0 

4 

1 

1 

0 

0 

4 

1 

163 


3 

2 

0 

3 

0 

0 

0 

1 

0 

3 
0 
0 
1 
6 
0 
1 
0 
0 
0 
0 

R 

0 

8 

0 

4 


G 

0 

1 

14 

4 

1 

4 

150 


H 

0 

3 

3 

12 

1 

2 

2 


2  122 
0  0 


1 
0 
0 

2 
5 
0 
8 
0 
0 


J 

1 

0 

0 

3 

0 

2 

0 

2 


0 

4 
1 
1 
0 
2 
1 
9 
2 

5 
0 
0 

6 
1 
0 
0 
0 

S 

2 

5 

0 

0 


0  10 


0  5 

2  5 

6  0 
0  3 

0  1 
7  0 

0  4 

0  0 

3  0 

0  0 
0  0 
0  5 

0  176  0 

11  0  135 


0 

0 

1 

0 

1 

0 


3 

0 

0 

0 

2 

0 


2 

6 

1 

2 

1 

1 

0 

3 

9 

0 

1 

2 

3 

0 

1 

1 

0 

T 

1 

0 

0 

0 

0 

2 

3 

4 
0 
0 
1 
0 
0 
0 
0 
1 
0 
1 
2 

163 


175  10 

7  158 
0  0 


K 

0 

1 

2 

4 

3 
0 
1 

4 
0 
0 

148 


L 

0 

0 

4 

4 

5 
0 
2 
2 
0 
0 
0 


0 

0 

0 

0 

0 

0 

0 

2 

0 

0 

0 

0 

3 

0 

3 

U 

1 

0 

0 

0 

0 

0 

0 

1 

0 

0 

3 
0 

4 
1 
3 
0 
0 
0 
0 
1 


0 

0 

0 

2 

0 

0 

0 

2 

0 

0 

0 

0 

0 

0 

3 


0  176 
0  0 


0 

0 

0 

0 

11 

0 

1 

0 

0 

0 

2 

0 

0 


0  18 


0  197 
0  0 
0  4 
0  0 
3  0 
3  0 


0 

0 

0 

3 

0 

3 

0 

0 

0 

0 

6 

0 

0 


V  IaI  X 

0  10 

3  0  1 

0  0  0 

0  0  5 

0  0  4 

0  0  1 

0  0  1 

4  0  0 

0  0  4 

0  0  2 

0  0  4 

0  0  1 

0  8  0 

0  2  0 

0  0  0 

0  0  0 

0  0  0 

2  0  0 

0  0  2 

0  0  0 

0  11 

152  1  0 

7  154  0 

0  0  160 

0  0  0 

0  0  0 
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##  hand_Letter_predictions 


Y  Z 


## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


A  0  0 
B  0  0 
C  0  0 
D  3  1 
E  0  3 
F  3  0 
G  0  0 
H  3  0 
Ill 
J  0  11 
K  0  0 
L  0  1 
M  0  0 
N  0  0 
0  0  0 
P  1  0 
Q  3  0 
R  0  0 
S  0  10 
7  5  2 
U  1  0 

V  5  0 
M  0  0 
X  1  1 

Y  157  0 
Z  0  164 


#  look  only  at  agreement  vs.  non-agreement 

#  construct  a  vector  of  TRUE/FALSE  indicating  correct/incorrect  predictions 
agreement  <-  hand_Letter_predictions  ==  hand_Letters_test$Letter 

#  check  if  characters  agree 
table (agreement) 

##  agreement 
##  FALSE  TRUE 
##  780  4220 

prop . tab  Le( tab Le( agreemen t)) 

##  agreement 
##  FALSE  TRUE 
##  0.156  0.844 


11.6.4  Step  4:  Improving  Model  Performance 

Replacing  the  vanilladot  linear  kernel  with  rbfdot  Radial  Basis  Function 
kernel,  i.e.,  “Gaussian”  kernel,  may  improve  the  OCR  prediction. 

hand_Letter_cLassifier_rbf  <-  ksvm( Letter  ~  .,  c/ata  =  hand_Letters_trainj 
kernel  =  "rbfdot") 

hand_Letter_predictions_rbf  <-  predict (hand_Letter_cLassifier_rbfj 
hand_Letters_test) 

agreement_rbf  <-  hand_Letter_predictions_rbf  ==  hand_Letters_test$ Letter 
table (agreement_rbf) 
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##  agreement_rbf 
##  FALSE  TRUE 
##  360  4640 

prop . table (table (agreement_rbf) ) 

##  agreement_rbf 
##  FALSE  TRUE 
##  0.072  0.928 

Note  the  improvement  of  the  automated  (SVM)  classification  accuracy  (0.928) 
for  rbf  dot  compared  to  the  previous  (vanilladot)  result  (0.844). 


11.7  Case  Study  4:  Iris  Flowers 


Let’s  have  another  look  at  the  iris  data  that  we  saw  in  Chap.  2. 


11.7.1  Step  1:  Collecting  Data 

SVM  requires  all  features  to  be  numeric,  and  each  feature  has  to  be  scaled  into  a 
relative  small  interval.  We  are  using  the  Edgar  Anderson’s  Iris  Data  in  R  for  this  case 
study.  This  dataset  measures  the  length  and  width  of  sepals  and  petals  from  three  Iris 
flower  species. 


11.7.2  Step  2:  Exploring  and  Preparing  the  Data 

Let’s  load  the  data  first.  In  this  case  study  we  want  to  explore  the  variable  Species. 


data(iris) 
str( iris) 

##  ' data. frame ' :  150  obs.  of  5  variables : 

##  $  Sepal. Length:  num  5.1  4.9  4.7  4.6  5  5.4  4.6  5  4.4  4.9  ... 

##  $  Sepal. Width  :  num  3.5  3  3.2  3.1  3.6  3.9  3.4  3.4  2.9  3.1  ... 

##  $  Petal. Length:  num  1.4  1.4  1.3  1.5  1.4  1.7  1.4  1.5  1.4  1.5  ... 

##  $  Petal. Width  :  num  0.2  0.2  0.2  0.2  0.2  0.4  0.3  0.2  0.2  0.1  ... 

##  $  Species  :  Factor  vj/  3  Levels  " setosa" j  "versicolor" j  ..  :  111111 

1111  ... 

table( iris$Species ) 

## 

##  setosa  versicolor  virginica 

##  50  50  50 
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The  data  look  good  but  we  still  can  normalize  the  features  either  by  hand  or  using 
an  R  function. 

Next,  we  can  separate  the  training  and  testing  datasets  using  15%-25%  rule. 

sub<-sample(nrow(iris),  floor (nrow(iris)*0 .  75)) 
iris_train<-iris[sub,  ] 
iris_test<-iris[ -subj  ] 

We  can  try  the  linear  and  non-linear  kernels  on  the  iris  data  (Figs.  11.15  and 
11.16). 

require(el071 ) 

iris . svm_l  <-  svm(Species~Petal . Length+Petal .Width,  data=iris_train , 

kernel=" Linear" ,  cost=l) 

iris . svm_2  <-  svm(Species~Petal . Length+Petal .Width,  data=iris_train, 

kernel=" radial ",  cost=l ) 
par ( mfrow=c (2,1)) 

plot (iris . svm_l,  iris[,c(5,3,4)]);  legend(" center" ,  "Linear") 


plot (iris. svm_2,  iris[,c(5,3,4)]);  Legend (" center" ,  "Radial") 


SVM  classification  plot 


i  - 1 - 1 - 1 - 1 - 1 

0.5  1.0  1.5  2.0  2.5 

Petak Width 


Fig.  11.15  Linear-SVM  kernel  classification  of  the  iris  flowers  dataset 


setosa  versicolor  virginica 
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SVM  classification  plot 


0.5  1.0  1.5  2.0  2.5 

PetaLWidth 


Fig.  11.16  Radial-SVM  kernel  classification  of  the  iris  flowers  dataset 


11.7.3  Step  3:  Training  a  Model  on  the  Data 

We  are  going  to  use  kernlab  for  this  case  study.  However  other  packages  like 
el  0  71  and  klaR  are  available  if  you  are  quite  familiar  with  C++. 

Let’s  break  down  the  function  ksvm  ( ) 


m  <— ksvm  ( target^predictors ,  data  =  mydata,  ker¬ 
nel  =  "rbfdot" ,  c  =  1) 


•  target:  the  outcome  variable  that  we  want  to  predict. 

•  predictors:  features  that  the  prediction  based  on.  In  this  function  we  can  use  the 

to  represent  all  the  variables  in  the  dataset  again. 

•  data:  the  training  dataset  that  the  target  and  predictors  can  be  find. 

•  kernel:  is  the  kernel  mapping  we  want  to  use.  By  default  it  is  the  radio  basis 
function  (rbfdot). 

•  C  is  a  number  that  specifies  the  cost  of  misclassification. 


setosa  versicolor  virginica 
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Let’s  install  the  package  and  play  with  the  data  now. 

#  install. packages("kernlab") 

Library (kern  Lab) 

iris_cLas<-ksvm(Species~.  j  data=iris_trairij  kerne L="vaniLLadot "  ) 

##  Setting  defauLt  kerneL  parameters 
iris_cLas 

##  Support  Vector  Machine  object  of  cLass  "ksvm" 

## 

##  S V  type:  C-svc  (cLassification) 

##  parameter  :  cost  C  =  1 
## 

##  Linear  (vaniLLa)  kerneL  function. 

## 

##  Number  of  Support  Vectors  :  24 
## 

##  Objective  Function  VaLue  :  -1.0066  -0.3309  -13.8658 
##  Training  error  :  0.026786 

Here,  we  used  all  the  variables  other  than  the  Species  in  the  dataset  as  pre¬ 
dictors.  We  also  used  kernel  vanilladot,  which  is  the  linear  kernel  in  this  model. 
We  get  a  training  error  that  is  less  than  0.02. 


11.7.4  Step  4:  Evaluating  Model  Performance 

Again,  the  predict  ( )  function  is  used  to  forecast  the  species  for  a  test  data.  Here, 
we  have  a  factor  outcome,  so  we  need  the  command  table  ( )  to  show  us  how  well 
do  the  predictions  and  actual  data  match. 

iris .pred< -predict ( iris_cLaSj  iris_test) 
tabLe( iris . pred,  iris_test$Species) 

## 

##  iris.pred  setosa  versicoLor  virginica 
##  setosa  13  0  0 

##  versicoLor  0  14  0 

##  virginica  0  1  10 

We  can  see  a  single  case  of  Iris  virginica  mis  classified  as  Iris  versicolor.  The  taxa 
of  all  other  flowers  are  correctly  predicted. 

To  see  the  results  more  clearly,  we  can  use  the  proportional  table  to  show  the 
agreements  of  the  categories. 

agreement< -iris . pred==iris_test$Species 
prop . tab Le( tab Le( agreemen t)) 

##  agreement 

##  FALSE  TRUE 

##  0.02631578947  0.97368421053 
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Here  ==  means  “equal  to”.  Over  97%  of  predictions  are  correct.  Nevertheless,  is 
there  any  chance  that  we  can  improve  the  outcome?  What  if  we  try  a  Gaussian 
kernel? 


11.7.5  Step  5:  RBF  Kernel  Function 

Linear  kernel  is  the  simplest  one  but  usually  not  the  best  one.  Let’s  try  the  RBF 
(Radial  Basis  “Gaussian”  Function)  kernel  instead. 

iris_clasl<-ksvm(Species~. }  data=iris_trairij  kernei="rbfdot” ) 
iris_cLasl 

##  Support  Vector  Machine  object  of  cLass  "ksvm" 

## 

##  S V  type:  C-svc  (classification) 

##  parameter  :  cost  C  =  1 
## 

##  Gaussian  Radial  Basis  kernel  function. 

##  Hyperparameter  :  sigma  =  0.877982617394805 

## 

##  Number  of  Support  Vectors  :  52 
## 

##  Objective  Function  Value  :  -4.6939  -5.1534  -16.2297 
##  Training  error  :  0.017857 

iris . predl<-predict( iris_claslj  iris_test) 
table ( iris . predlj  iris_test$Species) 


## 


## 

iris .predl 

setosa 

versicolor 

virginica 

## 

setosa 

13 

0 

0 

## 

versicolor 

0 

14 

2 

## 

virginica 

0 

1 

8 

agreement< -iris . predl==iris_test$Species 
prop . tab le( tab Le( agreemen t)) 


##  agreement 

##  FALSE  TRUE 

##  0.07894736842  0.92105263158 

Unfortunately,  the  model  performance  is  actually  worse  than  the  previous  one 
(you  might  get  slightly  different  results).  This  is  because  this  Iris  dataset  has  a  linear 
feature.  In  practice,  we  could  try  some  alternative  kernel  functions  and  see  which  one 
fits  the  dataset  the  best. 


11.7.6  Parameter  Tuning 


We  can  tune  the  SVM  using  the  tune .  svm  function  in  the  package  el071 
(Fig.  11.17). 
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Fig.  11.17  Training  data, 
cross-validation,  and  testing 
data  SVM  classification 
errors  of  the  iris  flowers 


Cost 


—  Train 
CV 
Test 


costs  =  exp(-5:8) 

tune,svm(Species~. j  kernel  =  "radial" ,  data  =  iris_trainj  cost  =  costs) 


## 

##  Parameter  tuning  of  'svm': 

## 

##  -  sampling  method:  10-fold  cross  validation 
## 

##  -  best  parameters : 

##  cost 
##  1 
## 

##  -  best  performance :  0.03636363636 


Further,  we  can  draw  a  Cross-Validation  (CV)  plot  to  gauge  the  model  perfor¬ 
mance,  see  cross-validation  details  in  Chap.  21: 


set.seed(201 7) 

require(sparsediscrim) ;  require  (reshape);  require (ggplot2) 

folds  =  cv_partition(iris$SpecieSj  num_folds  =  5) 
train_cv_error_svm  =  function(costC)  { 

#Train 

ir.svm  =  svm(Soecies~. .  data=iris . 

kernel=" radial" j  cost=costC) 

train_error  =  sum(ir . svm$fitted  !=  iris$Species)  /  nrow(iris) 

#Test 

test_error  =  sum(predict(ir. svmj  iris_test)  !=  iris_test$Species)  /  nrow(i 
ris_test) 

#CV  error 

ire.cverr  =  sapply (folds j  function(fold)  { 

svmcv  =  svm(Species~. jdata  =  iriSj  kerne L=" radial" j  cost=costCj  subset  = 
fold$training) 

svmpred  =  predict ( svmcv ,  iris[fold$testj]) 

return (sum(svmpred  !=  iris$Species[fold$test])  /  Length ( fold$test) ) 

}) 
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cv_error  =  mean (ire . cverr) 

return (c(train_err or j  cv_error}  test_error)) 

} 

costs  =  exp(-5:8) 

ir_cost_errors  =  sappLy(costSj  function(cost)  train_cv_error_svm(cost) ) 
df_errs  =  data. frame (t(ir_cost_errors ) j  costs) 
coLnames(df_errs)  =  c(  '  Train  ' j  'CV'j  '  Test 'Logcost') 
dataL  <-  melt(df_errSj  id=" Logcost") 

ggpLot(dataLj  aes_string(x=" Logcost" j  y="value"j  colour="variable"j 
group="variabLe" j  Linetype=" variable" ,  shape="variable" ) )  + 
geom_Line(size=l)  +  Labs(x  =  "Cosf'j 

y  =  "Classification  error" j  colour=""jgroup=""j 
Linetype=" " j  shape=" " )  +  scaie_x_iogl0() 


11.7.7  Improving  the  Performance  of  Gaussian  Kernels 

Now,  let’s  attempt  to  improve  the  performance  of  a  Gaussian  kernel  by  tuning: 

set.seed(2020) 
gammas  =  exp (-5:5) 

tune_g  =  tune. svm( Species^. j  kernel  =  "radial" j  data  =  iris_trainj  cost  =  co 

st Sj gamma  =  gammas) 

tune_g 

##  Parameter  tuning  of  'svm' : 

##  -  sampling  method:  10-fold  cross  validation 
##  -  best  parameters : 

##  gamma  cost 

##  0.01831563889  20.08553692 

##  -  best  performance :  0.025 

We  observe  that  the  model  achieves  a  better  prediction  now. 

iris . svm_g  <-  svm(Species~. j  data=iris_trainj 

kernel=" radial" j  gamma=0.0183j  cost=20) 
table (iris_test$SpecieSj  predict (iris . svm_gj  iris_test) ) 


## 

## 

setosa 

versicolor 

virginica 

## 

setosa 

13 

0 

0 

## 

versicolor 

0 

14 

1 

## 

virginica 

0 

0 

10 

agreement< -predict ( iris . svm_gj  iris_test)==iris_test$Species 
prop . tab Le( tab Le( agreemen t)) 

##  agreement 

##  FALSE  TRUE 

##  0.02631578947  0.97368421053 

Chapter  23  provides  more  details  about  prediction  and  classification  using  neural 
networks  and  deep  learning. 
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11.8  Practice 

11.8.1  Problem  1  Google  Trends  and  the  Stock  Market 

Use  the  Google  trend  data.  Fit  a  neural  network  model  with  the  same  training  data  as 
case  study  1.  This  time,  use  Investing  as  target  and  Unemployment, 
Rental,  RealEstate,  Mortgage,  Jobs,  DJI_Index,  StdDJ I  as  pre¬ 
dictors.  Use  three  hidden  nodes.  Note:  remember  to  change  the  columns  you  want  to 
include  in  the  test  dataset  when  predicting. 

The  following  number  is  the  correlation  between  predicted  and  observed  values. 


##  [,i] 

##  [1,]  0.8845711444 

You  might  get  a  slightly  different  results  since  the  weights  are  generated 
randomly. 


11.8.2  Problem  2:  Quality  of  Life  and  Chronic  Disease 


Use  the  same  data  in  Chap.  8  -  Quality  of  life  and  chronic  disease  (dataset  and  meta¬ 
data  doc). 

Let’s  load  the  data  first.  In  this  case  study,  we  want  to  use  the  variable 
CHARL SONS  CORE  as  our  target  variable. 


qoL<-read. csv( "https : //umich . instructure. com/fiLes/481332/downLoad?downLoad_ 
frd=l ") 
str(qoL ) 


##  ' data .frame ' :  2356  obs.  of  41  variables : 


## 

$ 

ID 

## 

$ 

INTERVIEIa/DATE 

## 

$ 

LANGUAGE 

## 

$ 

AGE 

## 

$ 

RACE_ETHNICITY 

## 

$ 

SEX 

## 

$ 

QOL_Q_01 

## 

$ 

QOL_Q_02 

## 

$ 

QOL_Q_03 

## 

$ 

QOL  Q_04 

## 

$ 

QOL_Q_05 

## 

$ 

QOL_Q_06 

## 

$ 

QOL_Q_07 

## 

$ 

QOL_Q_08 

## 

$ 

QOL  Q_09 

## 

$ 

QOL_Q_10 

## 

$ 

MSA_Q_01 

int  171  171  172  1  79  1 
int  0  427  0  0  0  42  0 
int  111111111 
int  49  49  62  44  64  64 
int  333733333 
int  222211211 
int  443633423 
int  433625414 
int  444636434 
int  442636225 
int  154626434 
int  442612412 
int  125-10584 
int  613666312 
int  343622422 
int  313636324 
int  132623411 


'0  180  181  182  183  186 
'00  ... 

2  ... 

52  48  49  78  ... 

4  ... 

1  ... 

5  ... 

6  ... 

4  ... 

2  ... 

3  ... 

4  ... 

1  7  ... 

4  ... 

4  ... 

3  ... 

2  ... 
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##  $  MSA_Q_02 

##  $  MSA_Q_03 

##  $  MSA_Q_04 

##  $  MSA_Q_05 

##  $  MSA_Q_06 

##  $  MSA_Q_07 

##  $  MSA_Q_08 

##  $  MSA_Q_09 

##  $  MSA_Q_10 

##  $  MSA_Q_11 

##  #  MSA_Q_12 

##  $  MSA_Q_13 

##  £  MSA_Q_14 

##  #  MSA_Q_15 

##  #  MSA_Q_16 

##  $  MSA_Q_17 

##  $  PH2_Q_01 

##  #  PH2_Q_02 

##  $  TOS_Q_01 

##  £  TOS_Q_02 

##  $  TOS_Q_03 

##  $  TOS_Q_04 

##  £  CHARLSONSCORE 

##  $  CHRONICDIS EASE SCORE 

.51  ... 


int 

1 

1 

2 

6 

1 

6 

4 

3 

2 

int 

2 

1 

2 

6 

1 

2 

3 

3 

1 

int 

1 

3 

2 

6 

1 

2 

1 

4 

1 

int 

1 

1 

1 

6 

1 

2 

1 

6 

3 

int 

1 

2 

2 

6 

1 

2 

1 

1 

2 

int 

2 

1 

3 

6 

1 

1 

1 

1 

1 

int 

1 

1 

1 

6 

1 

1 

1 

1 

2 

int 

1 

1 

1 

6 

2 

2 

4 

6 

2 

int 

1 

1 

1 

6 

1 

1 

1 

1 

1 

int 

2 

3 

2 

6 

1 

1 

2 

1 

1 

int 

1 

1 

2 

6 

1 

1 

2 

6 

1 

int 

1 

1 

1 

6 

1 

6 

2 

1 

4 

int 

1 

1 

1 

6 

1 

2 

1 

1 

3 

int 

2 

1 

1 

6 

1 

1 

3 

2 

1 

int 

2 

3 

5 

6 

1 

2 

1 

2 

1 

int 

2 

1 

1 

6 

1 

1 

1 

1 

1 

int 

3 

2 

1 

5 

1 

1 

3 

1 

2 

int 

4 

4 

1 

5 

1 

2 

1 

1 

4 

int 

2 

2 

2 

4 

1 

1 

2 

2 

1 

int 

1 

1 

1 

4 

4 

4 

1 

2 

4 

int 

4 

4 

4 

4 

4 

4 

4 

4 

4 

int 

5 

5 

5 

5 

5 

5 

5 

5 

5 

int 

2 

2 

3 

1 

0 

0 

2 

8 

0 

num 

1. 

,6 

1. 

,6 

1. 

,54  2.97 

4  ... 

2  ... 

5  ... 

2  ... 

2  ... 

5  ... 

1  ... 

1  ... 

3  ... 

5  ... 

3  ... 

2  ... 

1  ... 

3  ... 

2  ... 

3  ... 

3  ... 

2  ... 

1  ... 

4  ... 

4  ... 

5  .  .  . 

1  ... 

1.28  1.28  1.31  1.67  2.21  2 


Remove  the  first  two  columns  (we  don’t  need  ID  variables)  and  rows  that  have 
missing  CHARLSONSCORE  values  (e.g.,  CHARLSONSCORE  equal  to  "-9").  Note 
that,  !  qol $CHARLSONSCORE==  —9,  implies  that  we  only  select  the  rows  that 
have  CHARLSONSCORE  not  equal  to  —9.  The  exclamation  sign  indicates 
“exclude”.  Also,  we  need  to  convert  our  categorical  variable  CHARLSONSCORE 
into  a  factor. 


qoL<-qoL[ !qoL$CHARLS0NSC0RE==-9  ,  -c(lJ  2)] 
qoL$CHARLSONSCORE< -as  .factor (qoL$CHARLSONSCORE ) 
str(qoL ) 

##  'data,  frame' :  2328  obs.  of  39  variables : 


##  $  LANGUAGE  :  int 

##  $  AGE  :  int 

##  $  RACE_ETHNICITY  :  int 

##  $  SEX  :  int 

##  $  QOL_Q_01  :  int 

##  $  QOL_Q_02  :  int 

##  $  QOL_Q_03  :  int 

##  $  QOL_Q_04  :  int 

##  $  QOL_Q_05  :  int 

##  $  QOL_Q_06  :  int 

##  $  QOL_Q_07  :  int 

##  $  QOL_Q_08  :  int 

##  $  QOL_Q_09  :  int 

##  $  QOL_Q_10  :  int 

##  $  MSA_Q_01  :  int 

##  $  MSA_Q_02  :  int 


1111111112... 

49  49  62  44  64  64  52  48  49  78 
3337333334  ... 
2222112111  ... 
4436334235  ... 

4336254146.. . 
4446364344  ... 
4426362252  ... 

1546264343.. . 

4426124124.. . 
125-1058437... 
6136663124  ... 
3436224224  ... 
3136363243  ... 

1326234112.. . 

1126164324.. . 
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##  $  MSA_Q_03 

##  $  MSA_Q_04 

##  $  MSA_Q_05 

##  $  MSA_Q_06 

##  $  MSA_Q_07 

##  $  MSA_Q_08 

##  $  MSA_Q_09 

##  $  MSA_Q_10 
##  $  MSA_Q_11 

##  $  MSA_Q_12 

##  $  MSA_Q_13 

##  $  MSA_Q_14 

##  £  MSA_Q_15 

##  #  MSA_Q_16 

##  #  MSA_Q_17 

##  $  PH2_Q_01 

##  $  PH2_Q_02 

##  #  TOS_Q_01 
##  $  TOS_Q_02 

##  £  TOS_Q_03 

##  #  TOS_Q_04 

##  £  CHARLSONSCORE 

13912... 

##  $  CHRONICDISEASESCORE: 

2.51  ... 


int  2126123312. 
int  1326121415. 
int  1116121632. 
int  1226121122. 
int  2136111115. 
int  1116111121. 
int  1116224621. 
int  1116111113. 
int  2326112115. 
int  1126112613. 
int  1116162142. 
int  1116121131. 
int  2116113213. 
int  2356121212. 
int  2116111113. 
int  3215113123. 
int  4415121142. 
int  2224112211. 
int  1114441244. 
int  4444444444. 
int  5555555555. 
Factor  w/  11  Levels  "0", "1 

num  1.6  1.6  1.54  2.97  1.28 


"2 " j  "3",  ..  :  3  3  4  2  1 
1.28  1.31  1.67  2.21 


The  dataset  is  now  ready  for  processing.  First,  separate  the  dataset  into  training 
and  test  datasets  using  the  75%-25%  rule.  Then,  build  a  SVM  model  using  all  other 
variables  in  the  dataset  to  be  predictor  variables.  Try  to  add  different  cost  of 
misclassification  to  the  model.  Rather  than  the  default  C  =  1,  we  will  explore  the 
behavior  of  the  model  for  C  =  2  and  C  =  3  utilizing  the  radial  basis  kernel. 

The  output  for  C  =  2  is  included  below. 


##  Support  Vector  Machine  object  of  class  "ksvm" 

## 

##  SV  type:  C-svc  (classification) 

##  parameter  :  cost  C  =  2 
## 

##  Gaussian  Radial  Basis  kernel  function. 

##  Hyperparameter  :  sigma  =  0 .0174510649312293 

## 

##  Number  of  Support  Vectors  :  1703 
## 

##  Objective  Function  Value  :  -1798.9778  -666.9432  -352.2265  -46.2968  -15.92 
36  -9.2176  -7.1853  -27.9366  -16.3096  -3.5681  -697.4275  -362.6579  -47.0801  -1 
6.3701  -9.6556  -6.9882  -28.2074  -16.4556  -3.5121  -321.0676  -44.7405  -15.8416 
-9.1439  -6.8161  -26.7174  -15.4833  -3.3944  -43.1026  -15.2923  -7.994  -6.58  -24 
.8459  -14.6379  -3.4484  -13.9377  -5.2876  -5.6728  -15.2542  -9.8408  -3.255  -4.6 
982  -4.8924  -9.2482  -6.5144  -2.9608  -2.7409  -6.2056  -6.0476  -2.0833  -6.1775 
-4.919  -2.7715  -10.5691  -3.0835  -2.566 
##  Training  error  :  0.310997 

## 

##  qol .  pred2  01234567  89  10 
##  0  126  76  24  5  1  0  1  0  3  0  1 

##  1  88  170  47  19  9  2  1  0  4  3  0 
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## 

2 

1 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

## 

3 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

## 

4 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

## 

5 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

## 

6 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

## 

7 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

## 

8 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

## 

9 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

## 

10 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

##  agreement 

## 

FALSE 

TRUE 

##  0. 4914089347  0.5085910653 

The  output  for  C  =  3  is  included  below. 

##  Support  Vector  Machine  object  of  cLass  "ksvm" 

##  SV  type:  C-svc  (classification) 

##  parameter  :  cost  C  =  3 
## 

##  Gaussian  Radial  Basis  kernel  function. 

##  Hyperparameter  :  sigma  =  0.0168577510531693 

## 

##  Number  of  Support  Vectors  :  1695 
## 

##  Objective  Function  Value  :  -2440.0638  -915.9967  -492.6748  -63.2895  -21.09 
29  -11.9108  -10.2404  -39.1843  -21.976  -5.0624  -970.6173  -514.9584  -64.7791  - 
22.0947  -12.8987  -9.8114  -39.7908  -22.2957  -4.9403  -431.5178  -59.9296  -20.94 
08  -11.7468  -9.4269  -36.602  -20.1783  -4.6829  -56.9469  -19.7357  -9.238  -8.904 
7  -32.6121  -18.4667  -4.8007  -17.3102  -5.4133  -6.9733  -17.2097  -10.3016  -4.37 
39  -4.7816  -5.7083  -9.7236  -6.6365  -3.723  -2.7726  -6.4151  -6.4453  -2.1222  -8 
.03  -5.411  -3.3088  -11.9186  -3.996  -2.8572 
##  Training  error  :  0.266896 


## 


## 

qol.pred3 
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0 

0 
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3 
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0 
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7 

0 

0 
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0 
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0 
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8 

0 
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0 
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9 
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0 
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0 

0 

## 

10 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

## 

agreement 

## 

FALSE 

TRUE 

##  0.4914089347  0.5085910653 


Can  you  reproduce  (approximately)  these  results? 
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11.9  Appendix 

Below  is  some  additional  R  code  demonstrating  the  various  results  reported  in  this 

Chapter. 

#Picture  1 

x<-runif(1000,  -10,  10) 
y<-ifeLse(x>=0,  1,  0) 

plot(x,  y ,  xLab  =  "Sum  of  input  signals ",  ylab  =  "Output  signal",  main  =  "Th 
reshold") 

abline(v=0,  Lty=2) 

#Picture  2 

x<-runif (100000,  -10,  10) 
y<-l/(l+exp( -x) ) 

plot(x,  y,  xlab  =  "Sum  of  input  signals",  ylab  =  "Output  signal",  main  =  "Si 
gmoid") 

#Picture  3 

x<-runif (100000,  -10,  10) 
yl<-x 

y2<-ifelse(x<=-5,  -5,  ifelse(x>=5,  5,  x)) 
y3<- (exp(x) -exp( -x) )/ (exp(x)+exp( -x) ) 
y4<-exp( -xA2/2) 
par(mfrow=c(2,  2)) 

plot(x,  yl,  main=" Linear" ,  xlab="",  ylab="") 
plot(x,  y2,  main=" Saturated  Linear ",  xlab= ylab="") 
plot(x ,  y3,  main="Hyperbolic  tangent ",  xlab="",  ylab="") 
plot(x,  y4,  main  =  "Gaussian",  xlab="",  ylab="") 

#Picture  4 

A<-c(l,  4,  3,  2,  4,  8,  6,  10,  9) 

B<-c(l,  5,  3,  2,  3,  8,  8,  7,  10) 

plot  (A,  B,  xlab="",  ylab="",  pch=16,  cex=2) 

abline(v=5,  col="red ",  Lty=2) 

text (5. 4,  9,  Labels="A") 

abline(12,  -1,  col="red",  Lty=2) 

text (6,  5.4,  Labels="B") 

#Picture  5 

plot  (A,  B,  xlab="",  ylab="",  pch=16,  cex=2) 
segments (1,  1,  4,  5,  Lwd=l ,  col  =  "red") 

segments(l ,  1,  4,  3,  Lwd  =  1,  col  =  "red") 

segments(4,  3,  4,  5,  Lwd  =  1,  col  =  "red") 

segments (6,  8,  10,  7,  Lwd  =  1,  col  =  "red") 
segments (6,  8,  9,  10,  Lwd  =  1,  col  =  "red") 
segments (10,  7,  9,  10,  Lwd  =  1,  col  =  "red") 
segments (6,  8,  4,  5,  Lwd  =  1,  Lty=2) 
abiine(9 .833,  -2/3,  Lwd=2) 


Try  to  replicate  these  results  with  other  data  from  the  list  of  our  Case-Studies. 
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11.10  Assignments:  11.  Black  Box  Machine-Learning 
Methods:  Neural  Networks  and  Support  Vector 
Machines 

11.10.1  Learn  and  Predict  a  Power-Function 

In  Chap.  11,  we  learned  about  predicting  the  square-root  function.  It’s  just  one 

instance  of  the  power-function. 

•  Why  did  we  observe  a  decrease  in  the  accuracy  of  the  NN  prediction  of  the 
square-root  outside  the  interval  [0, 1]  (note  we  trained  inside  [0, 1])?  How  can  you 
improve  on  the  prediction  of  the  square-root  network? 

•  Can  you  design  a  more  generic  NN  network  that  can  learn  and  predict  a  power- 
function  for  a  given  power  (2g  9i)? 


11.10.2  Pediatric  Schizophrenia  Study 

Use  the  SOCR  Normal  and  Schizophrenia  pediatric  neuroimaging  study  data  to 

complete  the  following  tasks: 

•  Conduct  some  initial  data  visualization  and  exploration. 

•  Use  derived  neuroimaging  biomarkers  (e.g.,  Age ,  FS_IQ ,  TBV ,  GMV ,  WMV , 

CSF ,  Background ,  L_s upe rio rj'ronta l_gyrus ,  R_superiorJ'rontal_gyrus,  ..., 
brainstem)  to  train  a  NN  model  and  predict  DX  (Normals  =  1; 

Schizophrenia  =  2). 

•  Try  one  hidden  layer  with  a  different  number  of  nodes. 

•  Try  multiple  hidden  layers  and  compare  the  results  to  the  single  layer.  Which 
model  is  better? 

•  Compare  the  type  I  (false-positive)  and  type  II  (false-negative)  errors  for  the 
alternative  methods. 

•  Train  separate  models  to  predict  DX  (diagnosis)  for  the  Male  and  Female  cohorts, 
respectively.  Explain  your  findings. 

•  Train  an  SVM ,  using  ksvm  and  svm  in  el 0  71,  for  Age ,  FS_IQ ,  TBV ,  GMV , 
WMV ,  CSF ,  Background  to  predict  DX .  Compare  the  results  of  linear,  Gaussian, 
and  polynomial  SVM  kernels. 

•  Add  Sex  to  your  models  and  see  if  this  makes  a  difference. 

•  Expand  the  model  by  training  on  all  derived  neuroimaging  biomarkers  and 
re-train  the  SVM  using  Age ,  FS_IQ ,  TBV ,  GMV,  WMV ,  CSF ,  Background , 
L_superior _frontal _gyrus,  R_superiorJrontal_gyrus ,  ...,  brainstem.  Again,  try 
linear,  Gaussian,  and  polynomial  kernels.  Compare  the  results. 

•  Are  there  differences  between  the  alternative  kernels? 

•  For  Age ,  FS_IQ ,  TBV,  GMV ,  WMV,  CSF,  and  Background,  tune  parameters  for 
Gaussian  and  polynomial  kernels. 
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•  Generate  a  CV  (cross-validation)  plot  and  interpret  the  resulting  graph. 

•  Use  different  random  seeds  and  repeat  the  experiment  five  times.  Are  the  results 
stable? 

•  Inspecting  the  results  above,  explain  why  it  makes  sense  to  set  a  tune  over  a  range 
such  as  exp(— 5  :  8). 

•  How  can  we  design  alternative  tuning  strategies  other  than  greedy  search? 
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Chapter  12 

Apriori  Association  Rules  Learning 


® 

Check  for 
updates 


HTTP  cookies  are  used  to  monitor  web-traffic  and  track  users  surfing  the  Internet. 
We  often  notice  that  promotions  (ads)  on  websites  tend  to  match  our  needs,  reveal 
our  prior  browsing  history,  or  reflect  our  interests.  That  is  not  an  accident.  Nowa¬ 
days,  recommendation  systems  are  highly  based  on  machine  learning  methods  that 
can  learn  the  behavior,  e.g.,  purchasing  patterns,  of  individual  consumers.  In  this 
chapter,  we  will  uncover  some  of  the  mystery  behind  recommendation  systems  for 
transactional  records.  Specifically,  we  will  (1)  discuss  association  rules  and  their 
support  and  confidence;  (2)  the  Apriori  algorithm  for  association  rule  learning;  and 
(3)  cover  step-by-step  a  set  of  case-studies,  including  a  toy  example,  Head  and  Neck 
Cancer  Medications,  and  Grocery  purchases. 


12.1  Association  Rules 

Association  rules  are  the  result  of  process  analytics  (e.g.,  market  analysis)  that 
specify  patterns  of  relationships  among  items.  One  specific  example  would  be: 

{charcoal,  lighter,  chicken  wings}  — ■»  {barbecue  sauce} 

In  words,  charcoal,  lighter  and  chicken  wings  imply  barbecue  sauce.  Those  curly 
brackets  indicate  that  we  have  a  set.  Items  in  a  set  are  called  elements.  When  an  item- 
set  like  {charcoal,  lighter,  chicken  wings,  barbecue  sauce}  appears  in  our  dataset 
with  some  regularity,  we  can  discover  the  above  pattern. 

Association  rules  are  commonly  used  for  unsupervised  discovery  of  knowledge 
rather  than  prediction  of  outcomes.  In  biomedical  research,  association  rules  are 
widely  used  to: 
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•  Search  for  interesting  or  frequently  occurring  patterns  of  DNA. 

•  Search  for  protein  sequences  in  an  analysis  of  cancer  data. 

•  Find  patterns  of  medical  claims  that  occur  in  combination  with  fraudulent  credit 
card  or  insurance  use. 


12.2  The  Apriori  Algorithm  for  Association  Rule  Learning 

Association  rules  are  mostly  applied  to  transactional  data,  like  business,  trade, 
service  or  medical  records.  These  datasets  are  typically  very  large  in  number  of 
transactions  and  features.  This  will  add  lots  of  possible  orders  and  patterns  when  we 
try  to  do  analytics,  which  makes  data  mining  a  very  hard  task. 

With  the  Apriori  rule,  this  problem  is  easily  solved.  If  we  have  a  simple  prior 
(belief  about  the  properties  of  frequent  elements),  we  can  efficiently  reduce  the 
number  of  features  or  combinations  that  we  need  to  look  at. 

The  Apriori  algorithm  has  a  simple  apriori  belief  that  all  subsets  of  a  frequent 
item-set  must  also  be  frequent.  This  is  known  as  the  Apriori  property.  The  full  set 
in  the  last  example,  {charcoal,  lighter,  chicken  wings,  barbecue  sauce],  can  be  fre¬ 
quent  if  and  only  if  itself  and  all  its  subsets  of  single  elements,  pairs  and  triples  occur 
frequently.  We  can  see  that  this  algorithm  is  designed  for  finding  patterns  in  large 
datasets.  If  a  pattern  happens  frequently,  it  is  considered  “interesting”. 


12.3  Measuring  Rule  Importance  by  Using  Support 
and  Confidence 

Support  and  confidence  are  the  two  criteria  to  help  us  decide  whether  a  pattern  is 
“interesting”.  By  setting  thresholds  for  these  two  criteria,  we  can  easily  limit  the 
number  of  interesting  rules  or  item-sets  reported. 

For  item-sets  X  and  Y,  the  support  of  an  item-set  measures  how  frequently  it 
appears  in  the  data: 


support  (X) 


count(X ) 


N 


5 


where  N  is  the  total  number  of  transactions  in  the  database  and  count(X)  is  the 
number  of  observations  (transactions)  containing  the  item-set  X.  Of  course,  the  union 
of  item-sets  is  an  item-set  itself.  For  example,  if  Z  —  X,  Y,  then 


support(Z)  =  support(X ,  F). 

For  a  rule  X-+Y,  the  rule’s  confidence  measures  the  relative  accuracy  of  the 
rule: 


12.4  Building  a  Set  of  Rules  with  the  Apriori  Principle 
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confidence  (X  — »  F ) 


support(X ,  F) 
supportfX ) 


This  measures  the  joint  occurrence  of  X  and  Y  over  the  X  domain.  If  whenever 
X  appears  F  tends  to  be  present  too,  we  will  have  a  high  confidenceiX  — ■»  F).  The 
ranges  of  the  support  and  confidence  are  0  <  support ,  confidence  <  1.  Note  that  in 
probabilistic  terms,  Confidence  (X— >F)  is  equivalent  to  the  conditional  probability 
P(Y\X). 

{peanut  butter}  — >  {bread}  would  be  an  example  of  a  strong  rule  because  it  has 
high  support  as  well  as  high  confidence  in  grocery  store  transactions.  Shoppers  tend 
to  purchase  bread  when  they  get  peanut  butter.  These  items  tend  to  appear  in  the 
same  baskets,  which  yields  high  confidence  for  the  rule  {peanut  butter}  — >  {bread}. 


12.4  Building  a  Set  of  Rules  with  the  Apriori  Principle 

To  build  a  set  of  rules,  we  need  to  go  through  two  steps: 

•  Step  1:  Filter  all  item-sets  with  a  minimum  support  threshold.  This  is  accom¬ 
plished  iteratively  by  increasing  the  size  of  the  item-sets.  In  the  first  iteration, 
we  compute  the  support  of  singletons,  1 -item-sets.  At  the  next  iteration,  we 
compute  the  support  of  pairs  of  items,  and  so  on.  Item-sets  passing  iteration 
i  could  be  considered  as  candidates  for  the  next  iteration,  i  +  1.  If  { A },  { B }, 
{C}  are  all  frequent,  but  D  is  not  frequent  in  the  first  singleton-selection 
round,  then  in  the  second  iteration  we  only  consider  the  support  of  these 
pairs  {A,  Bj ,  { A,Cj ,  { B,C },  ignoring  all  pairs  including  D.  This  substantially 
reduces  the  cardinality  of  the  potential  item-sets  and  ensures  the  feasibility  of 
the  algorithm.  At  the  third  iteration,  if  { A,Cj ,  and  { B,Cj  are  frequently 
occurring,  but  {A,  Bj  is  not,  then  the  algorithm  may  terminate,  as  the  support 
of  {A,B,C}  is  trivial  (does  not  pass  the  support  threshold),  given  that  {A,  Bj 
was  not  frequent  enough. 

•  Step  2:  Using  the  item-sets  selected  in  Step  1,  generate  new  rules  with  confidence 
larger  than  a  predefined  minimum  confidence  threshold.  The  candidate  item- sets 
that  passed  Step  1  would  include  all  frequent  item-sets.  For  the  highly-supported 
item-set  {A,  C},  we  would  compute  the  confidence  measures  for  {A}  — ■»  {C}  as 
well  as  {C}  — >  {A}  and  compare  these  against  the  minimum  confidence  thresh¬ 
old.  The  surviving  rules  are  the  ones  with  confidence  levels  exceeding  that 
minimum  threshold. 
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12.5  A  Toy  Example 

Assume  that  a  large  supermarket  tracks  sales  data  by  stock-keeping  unit  (SKU)  for 
each  item,  i.e.,  each  item,  such  as  “butter”  or  “bread”,  is  identified  by  an  SKU 
number.  The  supermarket  has  a  database  of  transactions  where  each  transaction  is  a 
set  of  SKUs  that  were  bought  together  (Table  12.1). 

Suppose  the  database  of  transactions  consist  of  following  item-sets,  each 
representing  a  purchasing  order: 


require (knitr) 

item_tabLe  =  as.data.frame(t(c("{l,2,3,4}",  "{1,2,4}",  "{1,2}",  "{2,3,4}", 

"{2,3}",  "{3,4}",  "{2,4}"))) 

coLnames(item_tabLe)  <-  c("choicel",  "choice2",  "choice3" ,  "choice4", 

"choice5", "choice6" ,  "choice7" ) 
kabie(item_tabie,  caption  =  "Item  table") 

We  will  use  Apriori  to  determine  the  frequent  item-sets  of  this  database.  To  do  so, 
we  will  say  that  an  item-set  is  frequent  if  it  appears  in  at  least  3  transactions  of  the 
database,  i.e.,  the  value  3  is  the  support  threshold  (Table  12.2). 

The  first  step  of  Apriori  is  to  count  up  the  number  of  occurrences,  i.e.,  the  support, 
of  each  member  item  separately.  By  scanning  the  database  for  the  first  time,  we 
obtain  get: 


item_table  =  as.data.frame(t(c(3,6,4,5))) 
coinames(item_tabie)  <-  c("iteml" ,  "item2",  "item3",  ”item4") 
rownames(item_table)  <-  "support" 
kabLe(item_table,  caption  =  "Size  1  Support") 

All  the  item-sets  of  size  1  have  a  support  of  at  least  3,  so  they  are  all  frequent.  The 
next  step  is  to  generate  a  list  of  all  pairs  of  frequent  items. 

For  example,  regarding  the  pair  {1,2}:  the  first  table  of  Example  2  shows  items 
1  and  2  appearing  together  in  three  of  the  item-sets;  therefore,  we  say  that  the  support 
of  the  item  {1, 2}  is  3  (Tables  12.3  and  12.4). 


Table  12.1  Item  table 


choice  1 

choice2 

choice3 

choice4 

choice5 

choice6 

choice7 

{1,2, 3, 4} 

{1,2,4} 

{1,2} 

{2,3,4} 

{2,3} 

{3,4} 

{2,4} 

Table  12.2  Size  1  support 


iteml 

item2 

item3 

item4 

support 

3 

6 

4 

5 

{1,2} 

{1,3} 

{1,4} 

{2,3} 

{2,4} 

{3,4} 

support 

3 

1 

2 

3 

4 

3 

Table  12.3  Size  2  support 
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Table  12.4  Size  3  support 


{2,3,4} 

support 

2 

item_tabLe  =  as.data.frame(t(c(3, 1,2, 3,4, 3) ) ) 

coLnames(item_tabLe)  <-  c("{l,2}",  "{1,3}",  "{1,4}",  "{2,3}",  "{2,4}",  "{3,4}") 
rownames(item_tabLe)  <-  "support" 
kabie(item_tabie,  caption  =  "Size  2  Support") 

The  pairs  {1,2},  {2,3},  {2,4},  and  {3,4}  all  meet  or  exceed  the  minimum 
support  of  3,  so  they  ar t  frequent.  The  pairs  {1,3}  and  { 1, 4}  are  not  and  any  larger 
set  which  contains  {1,3}  or  {1,4}  cannot  be  frequent.  In  this  way,  we  can  prune  sets: 
we  will  now  look  for  frequent  triples  in  the  database,  but  we  can  already  exclude  all 
the  triples  that  contain  one  of  these  two  pairs: 

item_tabLe  =  as. data. frame (t(c(2))) 
coLnames(item_tabLe)  <-  c("{2,3,4}") 
rownames(item_tabLe)  <-  "support" 
kabie(item_tabie,  caption  =  "Size  3  Support") 

In  the  example,  there  are  no  frequent  triplets  -  the  support  of  the  item- set  {2, 3, 4} 
is  below  the  minimal  threshold,  and  the  other  triplets  were  excluded  because  they  were 
super  sets  of  pairs  that  were  already  below  the  threshold.  We  have  thus  determined  the 
frequent  sets  of  items  in  the  database,  and  illustrated  how  some  items  were  not 
counted  because  some  of  their  subsets  were  already  known  to  be  below  the  threshold. 

12.6  Case  Study  1:  Head  and  Neck  Cancer  Medications 
12.6.1  Step  1:  Collecting  Data 

To  demonstrate  the  Apriori  algorithm  in  a  real  biomedical  case-study,  we  will  use  a 
transactional  healthcare  data  representing  a  subset  of  the  Head  and  Neck  Cancer 
Medication  data,  which  is  available  in  our  case-studies  collection  as 

10 _ medication _ descriptions  .  csv.  It  consists  of  inpatient  medications 

for  head  and  neck  cancer  patients. 

The  data  is  in  wide  format,  see  Chap.  2,  where  each  row  represents  a  patient. 
During  the  study  period,  each  patient  had  records  for  a  maximum  of  5  encounters. 
NA  represents  no  medication  administration  records  in  this  specific  time  point  for  the 
specific  patient.  This  dataset  contains  a  total  of  528  patients. 


12.6.2  Step  2:  Exploring  and  Preparing  the  Data 

Different  from  our  data  imports  in  the  previous  chapters,  transactional  data  need  to 
be  ingested  in  R  using  the  read .  transactions  ( )  function.  This  function  will 
store  data  as  a  matrix  with  each  row  representing  an  example  and  each  column 
representing  a  feature. 
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Let’s  load  the  dataset  and  delete  the  irrelevant  index  column.  With  the  write  . 
csv  (R  data ,  "path" )  function  we  can  output  our  R  data  file  into  a  local  CSV 
file.  To  avoid  generating  another  index  column  in  the  output  CSV  file,  we  can  use  the 
row. names  =  F  option. 

med<-read.  csv(  "https : //umich .  instructure .  com/files/1678540/down Load Pdown Load 
_frd=l"j  stringsAsFuctors  =  FALSE) 
med<-med[j  -1] 

write. csv (med}  "medication . csv" j  row. names=F) 

Now  we  can  use  read .  transactions  ( )  in  the  arules  package  to  read  the 
CSV  file  we  just  outputted. 

#  install . packages( "arules" ) 

Library(aruLes) 

med< -read. transactions ('medication. csv" ,  sep  =  'j"j  ship  =  1}  rm.dupiicates= 
TRUE) 

##  distribution  of  transactions  with  duplicates : 

##  items 
##123 
##  79  166  248 

summary (med) 

##  transactions  as  itemMatrix  in  sparse  format  with 
##  528  rows  (eLements/itemsets/transactions)  and 

##  88  columns  (items)  and  a  density  of  0.02085486 

## 

##  most  frequent  items: 


## 

fentanyl  injection  uh 

hydrocodone  acetaminophen 

5mg  325mg 

## 

211 

165 

## 

cefazolin  ivpb  uh 

heparin 

injection 

## 

108 

105 

##  hydrocodone 

acetamin  75mg  500mg  15ml 

(Other) 

## 

60 

320 

## 

##  element  (itemset/transaction)  Length  distribution : 
##  sizes 

##  1  2  3  4  5 

##  248  166  79  23  12 

## 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  1.000  1.000  2.000  1.835  2.000  5.000 

## 

##  includes  extended  item  information  -  examples: 

##  Labels 

##  1  09  nacl 

##  2  09  nacl  bolus 

##  3  acetaminophen  multiroute  uh 
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Here  we  use  the  option  rm.  duplicates  =  T  because  we  may  have  similar 
medication  administration  records  for  two  different  patients.  The  option  skip  =  1 
means  we  skip  the  heading  line  in  the  CSV  file.  Now  we  get  a  transactional  data  with 
unique  rows. 

The  summary  of  a  transactional  data  contains  rich  information.  The  first  block  of 
information  tells  us  that  we  have  528  rows  and  88  different  medicines  in  this  matrix. 
Using  the  density  number  we  can  calculate  how  many  non  NA  medication  records 
are  in  the  data.  In  total,  we  have  528  x  88  =  46,464  positions  in  the  matrix.  Thus, 
there  are  46,464  x  0.0209  =  971  medicines  prescribed  during  the  study  period. 

The  second  block  lists  the  most  frequent  medicines  and  their  frequencies  in  the 
matrix.  For  example,  fentanyl  injection  uh  appeared  211  times;  that  is 
211/528  =  40  of  the  (treatment)  transactions.  Since  fentanyl  is  frequently  used  to 
help  prevent  pain  after  surgery  or  other  medical  procedure,  we  can  see  that  many  of 
these  patients  were  going  through  some  painful  medical  procedures. 

The  last  block  shows  statistics  about  the  size  of  the  transaction.  248  patients 
had  only  one  medicine  in  the  study  period,  while  12  of  them  had  5  medication 
records  one  for  each  time  point.  On  average,  the  patients  are  having  1.8  different 
medicines. 


Visualizing  Item  Support:  Item  Frequency  Plots 

The  summary  might  still  be  fairly  abstract;  let’s  visualize  the  data. 


inspect (med[l: 5 j  ] ) 

##  items 

##  [1]  {acetaminophen  uh, 

##  cefazoiin  ivpb  uh} 

##  [2]  {docusate j 
##  fioricetj 

##  heparin  injection j 

##  ondansetron  injection  uhj 

##  simvastatin} 

##  [3]  {hydrocodone  acetaminophen  5mg  325mg} 

##  [4]  {fentanyL  injection  uh} 

##  [5]  {cefazoiin  ivpb  uhj 

##  hydrocodone  acetaminophen  5mg  325mg} 

The  inspect  ()  call  shows  the  transactional  dataset.  We  can  see  that  the 
medication  records  of  each  patient  are  nicely  formatted  as  item-sets. 

We  can  further  analyze  the  frequent  terms  using  itemFrequency  ( ) .  This  will 
show  all  item  frequencies  alphabetically  ordered  from  the  first  five  outputs 
(Fig.  12.1). 
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Fig.  12.1  Rank-order  plot  of  item  frequencies 


itemFrequency(med[j  1:5]) 


## 

## 

## 

## 

## 

## 


09  nacL 
0.013257576 
09  nacL  bolus 
0.003787879 
acetaminophen  multiroute  uh 

0.001893939 


##  acetaminophen  codeine  120  mg  12  mg  5  ml 
##  0.001893939 
##  acetaminophen  codeine  300mg  30  mg 
##  0.020833333 


itemFrequencyPLot(medj  topN=20) 


The  above  graph  is  showing  us  the  top  20  medicines  that  are  most  frequently 
present  in  this  dataset.  Consistent  with  the  prior  summary  ( )  output,  f  entanyl  is 
still  the  most  frequent  item.  You  can  also  try  to  plot  the  items  with  a  threshold  for 
support.  Instead  of  topN  =  2  0,  just  use  the  option  support  =  0.1,  which  will 
give  you  all  the  items  have  a  support  greater  or  equal  to  0. 1 . 


Visualizing  Transaction  Data:  Plotting  the  Sparse  Matrix 

The  sparse  matrix  will  show  what  mediations  were  prescribed  for  each  patient 
(Fig.  12.2). 
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Fig.  12.2  A  characteristic  plot  of  the  prescribed  medications  (columns)  for  the  first  5  patients 
(rows) 


Fig.  12.3  A  characteristic 
plot  of  the  prescribed 
medications  (columns)  for 
100  random  patients  (rows) 


image (med[l: 5,  ]) 

The  image  on  Fig.  12.2  has  5  rows  (we  only  requested  the  first  5  patients)  and 
88  columns  (88  different  medicines).  Although  the  picture  may  be  a  little  hard  to 
interpret,  it  gives  a  sense  of  what  kind  of  medicine  is  prescribed  for  each  patient  in 
the  study. 

Let’s  see  an  expanded  graph  including  100  randomly  chosen  patients  (Fig.  12.3). 

subset_int  <-  sample (nrow(med) ,  100,  replace  =  F) 
image (med[subset_intj  ]) 

It  shows  us  clearly  that  some  medications  are  more  popular  than  others.  Now, 
let’s  fit  the  Apriori  model. 
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12.6.3  Step  3:  Training  a  Model  on  the  Data 


With  the  data  in  place,  we  can  build  the  association  rules  using  apriori  ( )  function. 

myrules  <-  apriori(data=mydata ,  parameter=list (support=0. 1,  confidence=0. 8, 
minlen=l) ) 

•  Data:  a  sparse  matrix  created  by  read .  transacat  ions  ( ) . 

•  Support:  minimum  threshold  for  support. 

•  Confidence:  minimum  threshold  for  confidence. 

•  minlen:  minimum  required  rule  items  (in  our  case,  medications). 

Setting  up  the  threshold  could  be  hard.  You  don’t  want  it  to  be  too  high  so  that 
you  get  no  rules  or  rules  that  everyone  knows.  You  don’t  want  to  set  it  too  low  either, 
to  avoid  too  many  rules  present.  Let’s  see  what  we  get  under  the  default  setting 
support  =  0.1,  confidence  =  0.8: 


apriori(med) 

##  Apriori 
## 

##  Parameter  specification : 

##  confidence  minval  smax  arem  a\/aL  originaLSupport  maxtime  support  minlen 

##  0.8  0.1  1  none  FALSE  TRUE  5  0.1  1 

##  maxlen  target  ext 
##  10  rules  FALSE 

## 

##  Algorithmic  control: 

##  filter  tree  heap  memopt  Load  sort  verbose 
##  0.1  TRUE  TRUE  FALSE  TRUE  2  TRUE 

## 

##  Absolute  minimum  support  count:  52 
## 

##  set  item  appearances  . . . [0  item(s)]  done  [0.00s]. 

##  set  transactions  ...[88  item(s)j  528  transaction(s) ]  done  [0.00s]. 

##  sorting  and  recoding  items  ...  [5  item(s)]  done  [0.00s]. 

##  creating  transaction  tree  ...  done  [0.00s]. 

##  checking  subsets  of  size  1  2  done  [0.00s]. 

##  writing  ...  [0  rule(s)]  done  [0.00s]. 

##  creating  S4  object  ...  done  [0.00s]. 

##  set  of  0  rules 

Not  surprisingly,  we  have  0  rules.  The  default  setting  is  too  high.  In  practice,  we 
might  need  some  time  to  fine-tune  these  thresholds,  which  may  require  certain 
familiarity  with  the  underlying  process  or  clinical  phenomenon. 

In  this  case  study,  we  set  support  =  0.1  and  confidence  =  0.25.  This 
requires  rules  that  have  appeared  in  at  least  10%  of  the  head  and  neck  cancer  patients 
in  the  study.  Also,  the  rules  have  to  have  least  25%  accuracy.  Moreover,  minlen  =  2 
would  be  a  very  helpful  option  because  it  removes  all  rules  that  have  fewer  than  two 
items. 
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The  results  suggest  we  have  a  new  rules  object  consisting  of  29  rules. 

med_ruLe<-apriori(medj  parameter=List(support=0.01j  confidence=0. 25 }  minLen= 

2)) 

##  Apriori 
## 

##  Parameter  specification : 

##  confidence  minval  smax  arem  aval.  originaLSupport  maxtime  support  minien 

##  0.25  0.1  1  none  FALSE  TRUE  5  0.01  2 

##  maxien  target  ext 
##  10  rules  FALSE 

## 

##  Algorithmic  control: 

##  filter  tree  heap  memopt  Load  sort  verbose 
##  0.1  TRUE  TRUE  FALSE  TRUE  2  TRUE 

## 

##  Absolute  minimum  support  count:  5 
## 

##  set  item  appearances  . . . [0  item(s)]  done  [0.00s]. 

##  set  transactions  ...[88  item(s)j  528  transaction (s) ]  done  [0.00s]. 

##  sorting  and  recoding  items  ...  [16  item(s)]  done  [0.00s]. 

##  creating  transaction  tree  ...  done  [0.00s]. 

##  cheching  subsets  of  size  1234  done  [0.00s]. 

##  writing  ...  [29  rule(s)]  done  [0.00s]. 

##  creating  S4  object  ...  done  [0.00s]. 

med_rule 

##  set  of  29  rules 


12.6.4  Step  4:  Evaluating  Model  Performance 

First,  we  can  obtain  the  overall  summary  of  this  set  of  rules. 

summary (med_rule ) 

##  set  of  29  rules 
## 

##  rule  length  distribution  (Lhs  +  rhs) : sizes 
##234 
##  13  12  4 

## 


## 

Min.  1st  Qu.  Median 

Mean  3rd  Qu.  Max. 

## 

2.00  2.00 

3.00 

2.69 

3.00  4.00 

## 

## 

summary  of  quality 

measures: 

## 

support 

confidence 

Lift 

## 

Min.  : 0.01136 

Min. 

: 0.2500 

Min.  : 0.7583 

## 

1st  Qu. : 0.01705 

1st  Qu. 

: 0.3390 

1st  Qu. : 1 . 3333 

## 

Median  : 0.01894 

Median 

: 0.4444 

Median  : 1.7481 

## 

Mean  : 0.0 3448 

Mean 

: 0.4491 

Mean  : 1.8636 

## 

3rd  Qu. : 0.03788 

3rd  Qu. 

:0. 5000 

3rd  Qu. : 2. 2564 

## 

Max.  : 0.11174 

Max. 

: 0 . 8000 

Max.  : 3.9111 

## 

## 

mining  info: 

##  data  ntransactions  support  confidence 
##  med  528  0.01  0.25 
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Scatter  plot  for  29  rules 


0.06 

support 


Fig.  12.4  Confidence-Support  scatterplot  of  29  rules 


We  have  13  rules  that  contain  two  items;  12  rules  containing  3  items,  and  the 
remaining  4  rules  contain  4  items. 

The  lift  column  shows  how  much  more  likely  one  medicine  is  to  be  prescribed 
to  a  patient  given  another  medicine  is  prescribed.  It  is  obtained  by  the  following 
formula: 


confidence  (X  — >  Y) 
support  (Y) 


Note  that  lift(X  — >  Y)  is  the  same  as  lift{Y  — ►  X).  The  range  of  lift  is  [0,  oo)  and 
higher  lift  is  better.  We  don’t  need  to  worry  about  support  since  we  already  set  a 
threshold  that  the  support  will  exceed. 

Using  hte  arugleViz  package  we  can  visualize  the  confidence  and  support 
scatter  plots  for  all  the  rules  (Fig.  12.4). 

#  install . packages ( "a rulesViz" ) 

Library (aruLesViz) 

plot (sort (med_ruLe)) 

Again,  we  can  utilize  the  inspect  ( )  function  to  see  exactly  what  are  these 
rules. 
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inspect(med_ruLe[l:3] ) 

##  Lhs  rhs  support 

confidence  Lift 

##  [1]  {acetaminophen  uh}  =>  {cefazoLin  ivpb  uh}  0.01136364 

0.4615385  2.256410 

##  [2]  {ampiciLLin  sulbactam  ivpb  uh}  =>  {heparin  injection}  0.01893939 
0.3448276  1.733990 

##  [3 ]  {ondansetron  injection  uh}  =>  {heparin  injection}  0.01704545 
0.2727273  1.371429 

Here,  lhs  and  rhs  refer  to  “left  hand  side”  and  “right  hand  side”  of  the  rule, 
respectively.  Lhs  is  the  given  condition  and  rhs  is  the  predicted  result.  Using  the 
first  row  as  an  example:  If  a  head-and-neck  patient  has  been  prescribed  acetamino¬ 
phen  (pain  reliever  and  fever  reducer),  it  is  likely  that  the  patient  is  also  prescribed 
cefazolin  (antibiotic  that  resist  bacterial  infections);  bacterial  infections  are  associ¬ 
ated  with  fevers  and  some  cancers. 


12.6.5  Step  5:  Improving  Model  Performance 

Sorting  the  Set  of  Association  Rules 

Sorting  the  resulting  association  rules  corresponding  to  high  lift  values  will  help  us 
select  the  most  useful  rules. 

inspect(sort(med_ruLej  by=" Lift" ) [1 : 3] ) 

##  Lhs  rhs 

support  confidence  Lift 
##  [1]  {fentanyL  injection  uhj 
##  heparin  injection j 

##  hydrocodone  acetaminophen  5mg  325mg}  =>  {cefazolin  ivpb  uh} 

0.01515152  0.8000000  3.911111 

##  [2]  {cefazolin  ivpb  uh} 

##  fentanyL  injection  uhj 

##  hydrocodone  acetaminophen  5mg  325mg}  =>  {heparin  injection} 

0.01515152  0.6153846  3.094505 

##  [3]  {heparin  injection } 

##  hydrocodone  acetaminophen  5mg  325mg}  =>  {cefazolin  ivpb  uh} 

0.03787879  0.6250000  3.055556 

These  rules  may  need  to  be  interpreted  by  clinicians  and  experts  in  the  specific 
context  of  the  study.  For  instance,  the  first  row,  (fentanyl,  heparin,  hydrocodone 
acetaminophen}  implies  {cefazolin}.  Fentanyl  and  hydrocodone  acetaminophen  are 
both  pain  relievers  that  may  be  prescribed  after  surgery.  Heparin  is  usually  used 
before  surgery  to  reduce  the  risk  of  blood  clots.  This  rule  may  suggest  that  patients 
who  have  undergone  surgical  treatments  may  likely  need  cefazolin  to  prevent  post- 
surgical  bacterial  infection. 
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Taking  Subsets  of  Association  Rules 

If  we  are  more  interested  in  investigating  associations  that  are  linked  to  a  specific 
medicine,  we  can  narrow  the  rules  down  by  making  subsets.  Let  us  try  investigating 
rules  related  to  fentanyl,  since  it  appears  to  be  the  most  frequently  prescribed  medicine. 


fi_rules< - subset (med_rulej  items  %in%  "fentonyL  injection  uh") 
inspect(fi_ruLes ) 


##  ihs 

support  confidence  Lift 

##  [1]  {ondansetron  injection  uh}  => 

0.01893939  0.3030303  0.7582938 

##  [2]  {fentanyL  injection  uhj 
##  ondansetron  injection  uh}  => 

5mg  325mg}  0.01136364  0.6000000  1.9200000 

##  [3]  {hydrocodone  acetaminophen  5mg  325mg} 

##  ondansetron  injection  uh}  => 

0.01136364  0.3750000  0.9383886 

##  [4]  {cefazoLin  ivpb  uh} 

##  fentanyl  injection  uh}  => 

0.01893939  0.5000000  2.5142857 

##  [5]  {fentanyl  injection  uh, 

##  heparin  injection}  => 

0.01893939  0.4761905  2.3280423 

##  [6]  {cefazoLin  ivpb  uh, 

##  fentanyl  injection  uh}  => 


5mg  325mg}  0.02462121  0.6500000  2.0800000 

##  [7]  {fentanyl  injection  uhj 
##  hydrocodone  acetaminophen  5mg  325mg}  => 

0.02462121  0.3250000  1.5888889 

##  [8]  {fentanyl  injection  uh, 

##  heparin  injection}  => 

5mg  325mg}  0.01893939  0.4761905  1.5238095 

##  [9]  {heparin  injection } 

##  hydrocodone  acetaminophen  5mg  325mg}  => 

0.01893939  0.3125000  0.7819905 

##  [10]  {fentanyl  injection  uhj 

##  hydrocodone  acetaminophen  5mg  325mg}  => 

0.01893939  0.2500000  1.2571429 

##  [11]  {cefazoLin  ivpb  uhj 

##  fentanyl  injection  uhj 

##  heparin  injection}  => 

5mg  325mg}  0.01515152  0.8000000  2.5600000 

##  [12]  {cefazoLin  ivpb  uhj 

##  heparin  injection j 

##  hydrocodone  acetaminophen  5mg  325mg}  => 

0.01515152  0.4000000  1.0009479 

##  [13]  {cefazoLin  ivpb  uhj 
##  fentanyl  injection  uhj 

##  hydrocodone  acetaminophen  5mg  325mg}  => 

0.01515152  0.6153846  3.0945055 

##  [14]  {fentanyl  injection  uhj 

##  heparin  injection j 

##  hydrocodone  acetaminophen  5mg  325mg}  => 

0.01515152  0.8000000  3.9111111 


rhs 

{fentanyl  injection  uh} 
{hydrocodone  acetaminophen 
{fentanyl  injection  uh} 
{heparin  injection} 
{cefazoLin  ivpb  uh} 
{hydrocodone  acetaminophen 
{cefazoLin  ivpb  uh} 
{hydrocodone  acetaminophen 
{fentanyl  injection  uh} 
{heparin  injection} 

{hydrocodone  acetaminophen 

{fentanyl  injection  uh} 

{heparin  injection} 

{cefazoLin  ivpb  uh} 


In  R  scripting,  the  notation  %in%  signifies  “belongs  to.”  There  are  14  rules  related 
to  this  item.  Let’s  plot  them  (Fig.  12.5). 
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Grouped  Matrix  far  Ihe  14  Fcnianyl-assaeiated  Rules 


5a«.  suptwrl 
Cotor:  lift 


CVHVtp^ 


Fig.  12.5  Bubble  chart  of  the  grouped  matric  for  14  rules 


plot (sort (fi_ruLeSj  by="Lift") }  method=" grouped" }  controL=List(type="items") 
,  main  =  "Grouped  Matrix  for  the  14  FentanyL-associated  Rules") 


##  Available  control  parameters  (with  default  values): 
##  main  =  Grouped  Matrix  for  14  Rules 
##  k  =20 


##  rhs_max  =  10 

##  Lhs_items  =  2 

##  aggr.fun  =  function  (x}  na.rm  =  FALSE)  UseMethod( "median" ) 

##  col  =  c("#EE0000FF"J  "#EE0303FF"J  "#EE0606FF"J  "#EE0909FF"J  "#EE0C0CFF 

",  "#EE0F0FFF ",  "#EE1212FF"J  "#EE1515FF" ,  "#EE1818FF"J  "#EE1B1BFF ",  "#EE1E1E 
FF ",  "#EE2222FF ",  "#EE2525FF"J  "#EE2828FF"J  "#EE2B2BFF ",  "#EE2E2EFF ",  "#EE31 

31 FF ",  "#EE3434FF"}  "#EE3737FF" ,  "#EE3434FF",  "#EE3D3DFF ",  "#EE4040FF"J  "#EE 

4444 FF ",  "#EE4747FF ",  "#EE4A4AFF ",  "#EE4D4DFF ",  "#EE5050FF"J  "#EE5353FF"J  "# 

EE5656FF ",  "#EE5959FF ",  "#EE5C5CFF ",  "#EE5F5FFF ",  "#EE6262FF ",  "#EE6666FF ", 
"#EE6969FF ",  "#EE6C6CFF ",  "#EE6F6FFF ",  "#EE7272FF ",  "#EE7575FF ",  "#EE7878FF 

",  "#EE7B7BFF ",  "#EE7E7EFF ",  "#EE8181FF" ,  "#EE8484FF"J  "#EE8888FF"J  "#EE8B8B 
FF ",  "#EE8E8EFF ",  "#EE9191FF" ,  "#EE9494FF"J  "#EE9797FF"J  "#EE9999FF"J  "#EE9B 

9BFF ",  "#EE9D9DFF ",  "#EE9F9FFF"J  "#EEA0A0FF ",  "#EEA2A2FF ",  "#EEA4A4FF ",  "#££ 

A5A5FF ",  "#EEA7A7FF ",  "#EEA9A9FF ",  "#EEABABFF ",  "#EEACACFF ",  "#EEAEAEFF"J  "# 

EEB0B0FF  ",  "#EEB1B1  FF  ",  "#EEB3B3FF  ",  "#EEB5B5FF  ",  "#EEB7B7FF  ",  "#EEB8B8FF  ", 
"#EEBABAFF ",  "#EEBCBCFF ",  "#EEBDBDFF ",  "#EEBFBFFF ",  "#EEC1C1FF ",  "#EEC3C3FF " 
,  "#EEC4C4FF ",  "#EEC6C6FF"J  "#EEC8C8FF"J  "#EEC9C9FF"J  "#EECBCBFF ",  "#EECDCD 
FF ",  "#EECFCFFF ",  "#EED0D0FF ",  "#EED2D2FF ",  "#EED4D4FF ",  "#EED5D5FF ",  "#££D7 

D7FF ",  "#EED9D9FF ",  "#EEDBDBFF ",  "#EEDCDCFF ",  "#EEDEDEFF ",  "#EEE0E0FF"J  "#££ 

E1E1FF ",  "#EEE3E3FF ",  "#£££5£5££",  "#£££7£7££",  "#EEE8E8FF ",  "#EEEAEAFF ",  "# 

EEECECFF ",  "#££££££££ "j 
##  reverse  =  TRUE 


##  xlab  = 
##  ylab  = 
##  Legend 
##  spacing 


NULL 

NULL 

=  Size:  support 
=  -1 


Color:  lift 
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##  paneL. function  =  function  (roWj  size ,  shading j  spacing)  { 


size[s 


ize  ==  0]  <-  NA 
gth(size) ) j  y  = 
p  =  gpar(fiLL  = 
##  gp_main 
##  gp_LabeLs 
##  gp_Labs 
##  gp_ Lines  = 
##  newpage 
##  interactive 
##  max. shading 
##  verbose 


shading[is . na(shading) ]  <-  1 


grid . circLe(x  =  c(l:Len 


roiAjj  r  =  size/2  *  (1  -  spacing)  j  default  .units  =  "native ",  g 
shading j  coL  =  shading j  alpha  =0.9))  } 


List(cex  =  1.2j  fontface  =  "bold1 
=  List(cex  =  0.8) 

List(cex  =  1.2j  fontface  =  "bold1 
List  (col  =  "gray"j  Lty  =  3) 

TRUE 

=  FALSE 
=  NA 
FALSE 


font 

font 


=  2) 
=  2) 


Saving  Association  Rules  to  a  File  or  Data  Frame 

We  can  save  these  rules  into  a  CSV  file  using  write  () .  It  is  similar  with  the 
function  write.csvO  that  we  have  mentioned  in  the  beginning  of  this  case 
study. 


write(med_ruLej  file  =  "medruLe .  csv"  j  sep="j"j  roiAj.  names=F) 

Sometimes  it  is  more  convenient  to  convert  the  rules  into  a  data  frame. 


med_df<-as(med_ruLej  "data .frame" ) 
str(med_df) 

##  ' data,  frame  ' :  29  obs.  of  4  variables : 

##  $  rules  :  Factor  w/  29  Levels  "{acetaminophen  uh}  =>  {cefazolin  ivpb 

uh  }"j  .  .  :  1  2  28  27  29  13  12  14  10  23  ... 

##  $  support  :  num  0.0114  0.0189  0.017  0.0189  0.0303  ... 

##  $  confidence:  num  0.462  0.345  0.273  0.303  0.485  ... 

##  $  lift  :  num  2.256  1.734  1.371  0.758  1.552  ... 

As  we  can  see,  the  rules  are  converted  into  a  factor  vector. 


12.7  Practice  Problems:  Groceries 


In  this  practice  problem,  we  will  investigate  the  associations  of  frequently  purchased 
groceries  using  the  grocery  dataset  in  the  R  base.  Firstly,  let’s  load  the  data. 


data( "Groceries ") 
summary (Groceries ) 

##  transactions  as  itemMatrix  in  sparse  format  with 
##  9835  rows  (elements/itemsets/transactions)  and 

##  169  columns  (items)  and  a  density  of  0.02609146 

## 

##  most  frequent  items: 


## 

whole  milk 

other  vegetables 

rolls/buns 

## 

2513 

1903 

1809 

## 

yogurt 

(Other) 

## 

1372 

34055 

soda 

1715 
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## 

##  element  (itemset/transaction)  Length  distribution : 
##  sizes 


## 

i r 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

1 

D 

## 

r 

2159 

1643 

1299 

1005 

855 

645 

545 

438 

350 

246 

182 

117 

78 

77 

5 

D 

## 

16 

17 

18 

19 

20 

21 

22 

23 

24 

26 

27 

28 

29 

32 

## 

46 

29 

14 

14 

9 

11 

4 

6 

1 

1 

1 

1 

3 

1 

## 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  1.000  2.000  3.000  4.409  6.000  32.000 

## 

##  includes  extended  item  information  -  examples: 

##  Labels  LeveL2  levell 

##  1  frankfurter  sausage  meat  and  sausage 

##  2  sausage  sausage  meat  and  sausage 

##  3  liver  loaf  sausage  meat  and  sausage 

We  will  try  to  find  out  the  top  5  frequent  grocery  items  and  plot  them  (Fig.  12.6). 

Then,  try  to  use  support  =  0.006,  confidence  =  0.25,  minlen  =  2  to 
set  up  the  grocery  association  rules.  Sort  the  top  3  rules  with  highest  lift. 

##  Apriori 
## 

##  Parameter  specification : 

##  confidence  minval  smax  arem  aval  originalSupport  maxtime  support  minlen 

##  0.25  0.1  1  none  FALSE  TRUE  5  0.006  2 

##  maxlen  target  ext 
##  10  rules  FALSE 

## 


to 


Fig.  12.6  Top-5  grocery  items  according  to  their  frequencies 
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##  Algorithmic  control: 

##  filter  tree  heap  memopt  Load  sort  verbose 
##  0.1  TRUE  TRUE  FALSE  TRUE  2  TRUE 

## 

##  Absolute  minimum  support  count:  59 
## 

##  set  item  appearances  . . . [0  item(s)]  done  [0.00s]. 

##  set  transactions  ...[169  item(s)j  9835  transaction(s) ]  done  [0.00s]. 

##  sorting  and  recoding  items  ...  [109  item(s)]  done  [0.00s]. 

##  creating  transaction  tree  ...  done  [0.00s]. 

##  checking  subsets  of  size  1234  done  [0.02s]. 

##  writing  ...  [463  rule(s)]  done  [0.00s]. 

##  creating  S4  object  ...  done  [0.00s], 

##  set  of  463  rules 

##  Ihs 
Lift 

##  [1]  [herbs] 

3.956477 
##  [2]  [berries] 

3.796886 

##  [3]  [tropical  fruit j 
##  other  vegetables j 

##  whole  milk] 

3.768074 

The  number  of  rules  (463)  appears  excessive.  We  can  try  stringer  parameters.  In 
practice,  it’s  more  possible  to  observe  underlying  rules  if  you  set  a  higher  confi¬ 
dence.  Here  we  set  the  confidence  =  0.6. 

groceryrules  <-  apriori(GrocerieSj  parameter  =  List(support  = 
nee  =  0.6j  minlen  =2)) 

##  Apriori 

##  Parameter  specification : 

##  confidence  minval  smax  arem  aval  originaLSupport  maxtime 

##  0.6  0.1  1  none  FALSE  TRUE  5 

##  maxlen  target  ext 
##  10  rules  FALSE 

## 

##  Algorithmic  control: 

##  filter  tree  heap  memopt  Load  sort  verbose 
##  0.1  TRUE  TRUE  FALSE  TRUE  2  TRUE 

## 

##  Absolute  minimum  support  count:  59 
##  set  item  appearances  . . . [0  item(s)]  done  [0.00s]. 

##  set  transactions  ...[169  item(s)j  9835  transaction(s) ]  done  [0.00s]. 

##  sorting  and  recoding  items  ...  [109  item(s)]  done  [0.00s]. 

##  creating  transaction  tree  ...  done  [0.00s]. 

##  checking  subsets  of  size  1234  done  [0.02s]. 

##  writing  ...  [8  rule(s)]  done  [0.00s], 

##  creating  S4  object  ...  done  [0.00s], 

groceryrules 

##  set  of  8  rules 


0. 006 j  confide 


support  minlen 
0.006  2 


rhs  support  confidence 
=>  [root  vegetables]  0.007015760  0.4312500 
=>  [whipped/sour  cream]  0.009049314  0.2721713 

=>  [root  vegetables]  0 .007015760  0.4107143 


12.8  Summary 
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Fig.  12.7  Live  demo: 
association  rule  mining 


inspect (sort (grocery rules }  by  =  "Lift") [1:3]) 

##  Lhs  rhs  support  confidence 

##  [1]  {butter j shipped/ sour  cream}  =>  {whoLe  miLk}  0.006710727  0.6600000 
##  [2]  {butter jyogurt}  =>  {whoLe  miLb}  0.009354347  0.6388889 

##  [3]  {root  vegetabLeSj  butter}  =>  {whoLe  miLb}  0.008235892  0.6377953 

##  Lift 
##  [1]  2.583008 

##  [2]  2.500387 

##  [3]  2.496107 

We  observe  mainly  rules  between  dairy  products.  It  makes  sense  that  customers 
pick  up  milk  when  they  walk  down  the  dairy  products  isle.  Experiment  further  with 
various  parameter  settings  and  try  to  interpret  the  results  in  the  context  of  this 
grocery  case-study  (Fig.  12.7). 

Mining  association  rules  Demo  https://rdrr.io/cran/arules/ 


#  copy-paste  this  R  code  into  the  Live  onLine  demo: 

#  https://rdrr.io/snippets/ 

#  press  RUN,  and  examine  the  resuLts. 

#  The  HYPERLINK  "https://archive.ics.uci.edu/mL/datasets/aduLt"  AduLt  dataset  incLudes  48842  sparse  transactions 

#  (rows)  and  115  items  (columns) . 

Library ( aruLes ) 

data ("AduLt") 

rules  <-  apriori(AduLtj 

parameter  =  List(supp  =  0.5,  conf  =  0.9,  target  =  "rules")) 
summary ( rules) 

inspect (sort (rules,  by  =  " Lift" ) [1 : 3] ) 


#  ResuLts:  mining  info: 

#  data  ntransactions  support  confidence 

#  AduLt  48842  0.5  0.9 

#  Lhs  rhs  support  confidence  Lift  count 

#  [1]  { sex=MaLe ,  native-country=United-States}  =>  { race=Uhite }  0.5415421  0.9051090  1.058554  26450 

#  [2]  { sex=MaLe ,  capital- Loss=None,  native -country =United-States}  =>  {race=White}  0.5113632  0.9032585  1.056390  24976 

#  [3]  { race=White }  =>  {native-country=United-States}  0.7881127  0.9217231  1.027076  38493 


12.8  Summary 

•  The  Apriori  algorithm  for  association  rule  learning  is  only  suitable  for  large 
transactional  data.  For  some  small  datasets,  it  might  not  be  very  helpful. 

•  It  is  useful  for  discovering  associations,  mostly  in  early  phases  of  an  exploratory 
study. 
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•  Some  rules  can  be  built  due  to  chance  and  may  need  further  verifications. 

•  See  also  Chap.  20  (Text  Mining  and  NLP). 

Try  to  replicate  these  results  with  other  data  from  the  list  of  our  Case-Studies. 


12.9  Assignments:  12.  Apriori  Association  Rules  Learning 

Use  the  SOCR  Jobs  Data  to  practice  learning  via  Apriori  Association  Rules 

•  Load  the  Jobs  Data.  Use  this  guide  to  load  HTML  data. 

•  Focus  on  the  Description  feature.  Replace  all  underscore  characters  with 
spaces. 

•  Review  Chap.  8,  use  tm  package  to  process  text  data  to  plain  text.  (Hint:  need  to 
apply  stemDocument  as  well,  we  will  discuss  more  details  in  Chap.  20.) 

•  Generate  a  “transaction”  matrix  by  considering  each  job  as  one  record  and 
description  words  as  “transaction”  items.  (Hint:  You  need  to  fill  missing  values 
since  records  do  not  have  the  same  length  of  description.) 

•  Save  the  data  using  write.csv()  and  then  use  read.transactions()  in  arules  package 
to  read  the  CSV  data  file.  Visualize  the  item  support  using  item  frequency  plots. 
What  terms  appear  as  more  popular? 

•  Fit  a  model:  myrules  <—  apriori(data  =  jobs, parameter  =  list(support  =  0.02, 
confidence  =  0.6,  minlen  =  2)).  Try  out  several  rule  thresholds  trading  off  gain 
and  accuracy. 

•  Evaluate  the  rules  you  obtained  with  lift  and  visualize  their  metics. 

•  Mine  medical  related  rules  (e.g.,  rules  include  “treatment”,  “patient”,  “care”, 
“diagnos.”  Notice  that  these  are  word  stems). 

•  Sort  the  set  of  association  rules  for  all  and  medical  related  subsets. 

•  Save  these  rules  into  a  CSV  file. 
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Chapter  13 

k-Means  Clustering 


® 

Check  for 
updates 


As  we  learned  in  Chaps.  7,  8,  and  9,  classification  could  help  us  make  predictions  on 
new  observations.  However,  classification  requires  (human  supervised)  predefined 
label  classes.  What  if  we  are  in  the  early  phases  of  a  study  and/or  don’t  have  the 
required  resources  to  manually  define,  derive  or  generate  these  class  labels? 

Clustering  can  help  us  explore  the  dataset  and  separate  cases  into  groups 
representing  similar  traits  or  characteristics.  Each  group  could  be  a  potential  candi¬ 
date  for  a  class.  Clustering  is  used  for  exploratory  data  analytics,  i.e.,  as 
unsupervised  learning ,  rather  than  for  confirmatory  analytics  or  for  predicting 
specific  outcomes. 

In  this  chapter,  we  will  present  (1)  clustering  as  a  machine  learning  task,  (2)  the 
silhouette  plots  for  classification  evaluation,  (3)  the  k-Means  clustering  algorithm 
and  how  to  tune  it,  (4)  examples  of  several  interesting  case-studies,  including 
Divorce  and  Consequences  on  Young  Adults,  Pediatric  Trauma,  and  Youth  Devel¬ 
opment,  (5)  demonstrate  hierarchical  clustering,  and  (6)  Gaussian  mixture  modeling. 


13.1  Clustering  as  a  Machine  Learning  Task 

As  we  mentioned  before,  clustering  represents  classification  of  unlabeled  cases. 
Scatter  plots  depict  a  simple  illustration  of  the  clustering  process.  Assume  we 
don’t  know  much  about  the  ingredients  of  frankfurter  hot  dogs  and  we  have  the 
following  graph  (Fig.  13.1). 
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Fig.  13.1  Hotdogs  dataset  -  scatterplot  of  calories  and  sodium  content  blocked  by  type  of  meat 


#  See  Chapter  12  code  for  complete  details 

#  install. packages ("rvest") 

Library ( rvest) 
wiki_url<- 

read_html( "http ://wiki . socr. umich . edu/index . php/SOCR_012708_ID_Data_HotDogs 
) 

html_nodes ( wiki_urlj  "#content ") 

hotdog<-  html_table(html_nodes(]A)iki_urlj  "table") [ [1  ]]) 
plot(hotdog$CalorieSj  hotdog$Sodiumj  main  =  "Hotdogs" j  xlab="Calories"j 
ylab=" Sodium ") 


In  terms  of  calories  and  sodium,  these  hot  dogs  are  clearly  separated  into  three 
different  clusters.  Cluster  1  has  hot  dogs  of  low  calories  and  medium  sodium 
content;  Cluster  2  has  both  calorie  and  sodium  at  medium  levels;  Cluster  3  has 
both  sodium  and  calories  at  high  levels.  We  can  make  a  bold  guess  about  the  meats 
used  in  these  three  clusters  of  hot  dogs.  For  Cluster  1,  it  could  be  mostly  chicken 
meat  since  it  has  low  calories.  The  second  cluster  might  be  beef,  and  the  third  one  is 
likely  to  be  pork,  because  beef  hot  dogs  have  considerably  less  calories  and  salt  than 
pork  hot  dogs.  However,  this  is  just  guessing.  Some  hot  dogs  have  a  mixture  of  two 
or  three  types  of  meat.  The  real  situation  is  somewhat  similar  to  what  we  guessed  but 
with  some  random  noise,  especially  in  Cluster  2. 

The  following  two  plots  show  the  primary  type  of  meat  used  for  each  hot  dog 
labeled  by  name  (top)  and  color-coded  (bottom)  (Figs.  13.2  and  13.3). 
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Fig.  13.2  Scatterplot  of  calories  and  sodium  content  with  meat  type  labels 


Type  •  Beef  A  Meat  □  Poultry 


Fig.  13.3  An  alternative  scatterplot  of  the  hotdogs  calories  and  sodium 


446 


13  k-Means  Clustering 


13.2  Silhouette  Plots 

Silhouette  plots  are  useful  for  interpretation  and  validation  of  consistency  of  all 
clustering  algorithms.  The  silhouette  value,  £[—1,1],  measures  the  similarity  (cohe¬ 
sion)  of  a  data  point  to  its  cluster  relative  to  other  clusters  (separation).  Silhouette 
plots  rely  on  a  distance  metric,  e.g.,  the  Euclidean  distance,  Manhattan  distance, 
Minkowski  distance,  etc. 

•  High  silhouette  value  suggest  that  the  data  matches  its  own  cluster  well. 

•  A  clustering  algorithm  performs  well  when  most  Silhouette  values  are  high. 

•  Low  value  indicates  poor  matching  within  the  neighboring  cluster. 

•  Poor  clustering  may  imply  that  the  algorithm  configuration  may  have  too  many  or 
too  few  clusters. 

Suppose  a  clustering  method  groups  all  data  points  (objects),  {Xi)h  into  k  clusters 
and  define: 

•  di  as  the  average  dissimilarity  of  Xt  with  all  other  data  points  within  its  cluster.  dt 
captures  the  quality  of  the  assignment  of  Xt  to  its  current  class  label.  Smaller  or 
larger  dt  values  suggest  better  or  worse  overall  assignment  for  Xt  to  its  cluster, 
respectively.  The  average  dissimilarity  of  Xt  to  a  cluster  C  is  the  average  distance 
between  Xt  and  all  points  in  the  cluster  of  points  labeled  C. 

•  U  as  the  lowest  average  dissimilarity  of  Xt  to  any  other  cluster,  that  Xt  is  not  a 
member  of.  The  cluster  corresponding  to  lh  the  lowest  average  dissimilarity,  is 
called  the  Xt  neighboring  cluster,  as  it  is  the  next  best  fit  cluster  for  Xt. 

Then,  the  silhouette  of  Xt  is  defined  by: 


-1  <  Si  = 


li  d( 
max{//,  di} 


if  di  <  h 
if  di  =  U  . 
if  di  >  k 


Note  that: 

•  -1  <  Si  <  1, 

•  Si  — »  1  when  dt  /;,  i.e.,  the  dissimilarity  of  Xt  to  its  cluster,  C  is  much  lower 
relative  to  its  dissimilarity  to  other  clusters,  indicating  a  good  (cluster  assignment) 
match.  Thus,  high  Silhouette  values  imply  the  data  is  appropriately  clustered. 

•  Conversely,  —  1  <—  st  when  lt  dh  dt  is  large,  implying  a  poor  match  of  Xt  with  its 
current  cluster  C,  relative  to  neighboring  clusters.  Xt  may  be  more  appropriately 
clustered  in  its  neighboring  cluster. 

•  Si  ~  0  means  that  the  Xt  may  lie  on  the  border  between  two  natural  clusters. 


13.3  The  k-Means  Clustering  Algorithm 
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13.3  The  k-Means  Clustering  Algorithm 

The  k-means  algorithm  is  one  of  the  most  commonly  used  algorithms  for  clustering. 


13.3.1  Using  Distance  to  Assign  and  Update  Clusters 


This  algorithm  is  similar  to  k-nearest  neighbors  (KNN)  presented  in  Chap.  7.  In 
clustering,  we  don’t  have  a  priori  pre-determined  labels,  and  the  algorithm  is  trying 
to  deduce  intrinsic  groupings  in  the  data. 

Similar  to  KNN,  k-means  uses  Euclidean  distance  (||2  norm)  most  of  the  times, 
however  Manhattan  distance  (||i  norm),  or  the  more  general  Minkowski  distance 


En 

i= i  I  Pi  ~  9i 


may  also  be  used.  For  c  —  2,  the  Minkowski  distance 


represents  the  classical  Euclidean  distance: 


dist(x,y )  =  .  1^2  (xi  -y,-)2. 

V  n=  1 

How  can  we  separate  clusters  using  this  formula?  The  k-means  protocol  is  as 
follows: 

•  Initiation :  First,  we  define  k  points  as  cluster  centers.  Often  these  points  are 
k  random  points  from  the  dataset.  For  example,  if  k  =  3,  we  choose  three  random 
points  in  the  dataset  as  cluster  centers. 

•  Assignment :  Second,  we  determine  the  maximum  extent  of  the  cluster  boundaries 
that  all  have  maximal  distance  from  their  cluster  centers.  Now  the  data  is 
separated  into  k  initial  clusters.  The  assignment  of  each  observation  to  a  cluster 
is  based  on  computing  the  least  within-cluster  sum  of  squares  according  to  the 
chosen  distance.  Mathematically,  this  is  equivalent  to  Voronoi  tessellation  of  the 
space  of  the  observations  according  to  their  mean  distances. 

•  Update :  Third,  we  update  the  centers  of  our  clusters  to  new  means  of  the  cluster 
centroid  locations.  This  updating  phase  is  the  essence  of  the  k-means  algorithm. 

Although  there  is  no  guarantee  that  the  k-means  algorithm  converges  to  a  global 
optimum,  in  practice,  the  algorithm  tends  to  converge,  i.e.,  the  assignments  no  longer 
change,  to  a  local  minimum  as  there  are  only  a  finite  number  of  such  Voronoi 
partitionings. 


448 


13  k-Means  Clustering 


Fig.  13.4  Elbow  plot  of  the 
within-group  homogeneity 
against  the  number  of 
groups  parameter  ( k ) 
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13.3.2  Choosing  the  Appropriate  Number  of  Clusters 

We  don’t  want  our  number  of  clusters  to  be  either  too  large  or  too  small.  If  it  is  too 
large,  the  groups  are  too  specific  to  be  meaningful.  On  the  other  hand,  too  few  groups 
might  be  too  broadly  general  to  be  useful.  As  we  mentioned  in  Chap.  7,  k  =  y/|  is  a 
good  place  to  start.  However,  it  might  generate  a  large  number  of  groups.  Also,  the 
elbow  method  may  be  used  to  determine  the  relationship  of  k  and  homogeneity  of  the 
observations  of  each  cluster.  When  we  graph  within-group  homogeneity  against  k, 
we  can  find  an  “elbow  point”  that  suggests  a  minimum  k  corresponding  to  relatively 
large  within-group  homogeneity  (Fig.  13.4). 

This  graph  shows  that  homogeneity  barely  increases  above  the  “elbow  point”. 
There  are  various  ways  to  measure  homogeneity  within  a  cluster.  For  detailed 
explanations  please  read  On  clustering  validation  techniques,  Journal  of  Intelligent 
Information  Systems  Vol.  17,  pp.  107-145,  by  M.  Halkidi,  Y.  Batistakis,  and 
M.  Vazirgiannis  (2001). 


13.4  Case  Study  1:  Divorce  and  Consequences  on  Young 
Adults 

13.4.1  Step  1:  Collecting  Data 

The  dataset  we  will  be  using  is  the  Divorce  and  Consequences  on  Young  Adults 
dataset.  This  is  a  longitudinal  study  focused  on  examining  the  consequences  of 
recent  parental  divorce  for  young  adults  (initially  ages  18-23)  whose  parents  had 
divorced  within  15  months  of  the  study’s  first  wave  (1990-91).  The  sample 
consisted  of  257  White  respondents  with  newly  divorced  parents.  Here  we  have  a 
subset  of  this  dataset  with  47  respondents  in  our  case-studies  folder, 
CaseStudy01_Divorce_YoungAdults_Data.csv. 
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Variables 

•  DIVYEAR:  Year  in  which  parents  were  divorced.  Dichotomous  variable  with 
1989  and  1990. 

•  Child  affective  relations: 

-  Momint:  Mother  intimacy.  Interval  level  data  with  four  possible  responses 
(1 -extremely  close,  2-quite  close,  3 -fairly  close,  4-  not  close  at  all). 

-  Dadint:  Father  intimacy.  Interval  level  data  with  four  possible  responses 
(1 -extremely  close,  2-quite  close,  3-fairly  close,  4-not  close  at  all). 

-  Live  with  mom:  Polytomous  variable  with  three  categories  (1-  mother  only,  2- 
father  only,  3-  both  parents). 

•  momclose:  measure  of  how  close  the  child  is  to  the  mother  (1 -extremely  close, 

2- quite  close,  3-fairly  close,  4-not  close  at  all). 

•  Depression:  Interval  level  data  regarding  feelings  of  depression  in  the  past 
4  weeks.  Possible  responses  are  1 -often,  2-sometimes,  3-hardly  ever,  4-never. 

•  Gethitched:  Polytomous  variable  with  four  possible  categories  indicating 
respondent’s  plan  for  marriage  (1 -Marry  fairly  soon,  2-marry  sometime, 

3- never  marry,  8-don’t  know). 


13.4.2  Step  2:  Exploring  and  Preparing  the  Data 

Let’s  load  the  dataset  and  pull  out  a  summary  of  all  variables. 

divorce<-read. csv( "https : //umich . instructure . com/ fiLes/ 399118/ down  Load? downL 
oad_frd=l ") 
summary ( divorce ) 


## 

DIVYEAR 

momint 

dadint 

momcLose 

## 

Min.  :89.00 

Min .  :1. 000 

Min .  :1. 000 

Min.  : 1.000 

## 

1st  Qu. :89.00 

1st  Qu. : 1.000 

1st  Qu. : 2.000 

1st  Qu. : 1.000 

## 

Median  :90.00 

Median  : 1.000 

Median  :2.000 

Median  : 2.000 

## 

Mean  : 89 . 68 

Mean  : 1 . 809 

Mean  : 2 . 489 

Mean  : 1 . 809 

## 

3rd  Qu. :90. 00 

3rd  Qu. : 3.000 

3rd  Qu. : 3. 000 

3rd  Qu. : 2. 000 

## 

Max.  :90.00 

Max.  :4.000 

Max.  :4.000 

Max.  :4.000 

## 

depression 

Livewithmom 

gethitched 

## 

Min .  : 1 . 000 

Min .  :1. 000 

Min .  :1. 000 

## 

1st  Qu. : 2.000 

1st  Qu. : 1.000 

1st  Qu. : 2.000 

## 

Median  : 3.000 

Median  : 1.000 

Median  : 2.000 

## 

Mean  :2.851 

Mean  : 1 . 489 

Mean  :2.213 

## 

3rd  Qu.  :4.000 

3rd  Qu. : 2.000 

3rd  Qu. : 2. 000 

## 

Max.  :4.000 

Max.  :9.000 

Max.  :8.000 

According  to  the  summary,  DIVYEAR  is  actually  a  dummy  variable  (either  89  or 
90).  We  can  recode  (binarize)  the  DIVYEAR  using  the  if  else  ( )  function  (men¬ 
tioned  in  Chap.  8).  The  following  line  of  code  generates  a  new  indicator  variable  for 
divorce  year  =  1990. 
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divorce$DIVYEAR<-ifeLse(divorce$DIVYEAR==89J  Q,  1) 

We  also  need  another  preprocessing  step  to  deal  with  1  ivewi  thmom,  which  has 
missing  values,  1  ivewi  thmom  =  9.  We  can  impute  these  using  momint  and 
dadint  variables  for  each  specific  participant 

tabLe(divorce$Livewithmom) 

## 

##129 
##  31  15  1 

divorce [divorce$Livewithmom==9j  ] 

##  DIVYEAR  momint  dadint  momcLose  depression  Livewithmom  gethitched 

##  45  1  3  1  3  3  9  2 

For  instance,  respondents  that  feel  much  closer  to  their  dads  may  be  assigned 
divorce $1  ivewi thmom==2,  suggesting  they  most  likely  live  with  their 
fathers.  Of  course,  alternative  imputation  strategies  are  also  possible. 

divorce[45j  6]<-2 
divorce[45j  ] 

##  DIVYEAR  momint  dadint  momcLose  depression  Livewithmom  gethitched 

##  45  1  3  1  3  3  2  2 


13.4.3  Step  3:  Training  a  Model  on  the  Data 

We  are  only  using  R  base  functionality,  so  no  need  to  install  any  additional  packages 
now,  however  library  (stats)  may  still  be  necessary.  Then,  the  function 
kmeans  ( )  will  provide  the  k-means  clustering  of  the  data. 

myclusters< -kmeans (mydata,  k) 

•  mydata :  dataset  in  a  matrix  form. 

•  k :  number  of  clusters  we  want  to  create. 

•  output : 

-  myclusters$cluster:  vector  indicating  the  cluster  number  for  every  observation. 

-  myclusters$center:  a  matrix  showing  the  mean  feature  values  for  every  center. 

-  my cluster$ size:  a  table  showing  how  many  observations  are  assigned  to  each 
cluster. 
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Before  we  perform  clustering,  we  need  to  standardize  the  features  to  avoid 
biasing  the  clustering  based  on  features  that  use  large-scale  values.  Note  that 
distance  calculations  are  sensitive  to  measuring  units.  The  method  as.  data, 
f  r ame  ( )  will  convert  our  dataset  into  a  data  frame  allowing  us  to  use  the 
1  apply  ( )  function.  Next,  we  use  a  combination  of  1  apply  ( )  and  scale  ( ) 
to  standardize  our  data. 


di_z<-  as. data. frame(LappLy (divorce j  scaLe)) 
str(di_z) 

##  ' data,  frame  ' :  47  obs.  of  7  variables: 

##  $  DIVYEAR  :  num  0.677  0.677  -1.445  0.677  -1.445  ... 

##  $  momint  :  num  1.258  1.258  -0.854  1.258  -0.854  ... 

##  $  dadint  :  num  -0.514  -0.514  0.536  1.586  0.536  ... 

##  $  momclose  :  num  0.225  1.401  -0.951  1.401  0.225  ... 

##  $  depression  :  num  0.164  -0.937  1.265  0.164  -2.038  ... 

##  $  Livewithmom:  num  -0.711  1.377  -0.711  -0.711  -0.711  ... 

##  $  gethitched  :  num  0.846  -0.229  -0.229  0.846  -0.229  ... 

The  resulting  dataset,  di_z,  is  standardized  so  all  features  are  unitless  and  follow 
approximately  standardized  normal  distribution. 

Next,  we  need  to  think  about  selecting  a  proper  k.  We  have  a  relatively  small 
dataset  with  47  observations.  Obviously  we  cannot  have  a  k  as  large  as  10.  The  rule 
of  thumb  suggests  k  =  y7 47/ 2  =  4.8.  This  would  be  relatively  large  also  because 
we  will  have  less  than  10  observations  for  each  cluster.  It  is  very  likely  that  for  some 
clusters  we  only  have  one  observation.  A  better  choice  may  be  3.  Let’s  see  if  this 
will  work. 


Library ( stats  ) 
set.seed(321 ) 

diz_clussters<-kmeans(di_Zj  3) 


13.4.4  Step  4:  Evaluating  Model  Performance 

Let’s  look  at  the  clusters  created  by  the  k-means  model. 

diz_clussters$size 
##  [1]  12  24  11 

At  first  glance,  it  seems  that  3  worked  well  for  the  number  of  clusters.  We  don’t 
have  any  cluster  that  contains  a  small  number  of  observations.  The  three  clusters 
have  relatively  equal  number  of  respondents. 

Silhouette  plots  represent  the  most  appropriate  evaluation  strategy  to  assess  the 
quality  of  the  clustering.  Silhouette  values  are  between  —  1  and  1 .  In  our  case,  two 
data  points  correspond  to  negative  Silhouette  values,  suggesting  these  cases  may  be 
“mis-clustered”  or  perhaps  are  ambiguous,  as  the  Silhouette  value  is  close  to  0.  We 
can  observe  that  the  average  Silhouette  is  reasonable,  about  0.2  (Fig.  13.5). 
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Fig.  13.5  Silhouette  plot  Silhouette  plot  of  [x  =  dizclusstersSciuster,  dist  =  dis) 

for  the  3  classes  n  =  47  3  clusters  C, 

f :  n,  |  Sj 

1  i  12  |  0,16 


2  :  24  I  0.28 


3:  11  |  0.08 


0.0  0.2  0.4  0.6  0.8  1.0 

Silhouette  width  s, 

Average  silhouette  width  :  0.2 


require( cLuster) 
dis  =  dist(di_z) 

siL  =  siLhouette(diz_cLussters$cLusterj  dis) 
summary (siL) 

##  Silhouette  of  47  units  in  3  clusters  from  silhouette . default (x  =  diz_clus 
sters$clusterj  dist  =  dis)  : 

##  Cluster  sizes  and  average  silhouette  widths: 

##  12  24  11 

##  0.16444649  0.27684356  0.07921684 
##  Individual  silhouette  widths: 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  -0.08466  0.11760  0.20080  0.20190  0.30450  0.39820 

plot( sil) 

The  next  step  would  be  to  interpret  the  clusters  in  the  context  of  this  social  study. 


diz  clussters$centers 


##  DIVYEAR  momint  dadint  momclose  depression 

##  1  0.5004720  1.1698438  -0.07631029  1.2049200  -0.1112567 

##  2  -0.2953914  -0.5016290  0.36107795  -0.5096937  0.1180883 

##  3  0.0985208  -0.1817299  -0.70455885  -0.2023993  -0.1362761 

##  gethitched 

##  1  -0.1390230 

##  2  -0.1390230 

##  3  0.4549845 


livewithmom 

0.1591755 

-0.7107373 

1.3770536 


This  result  shows: 

•  Cluster  1 :  divyear  =  mostly  90,  momint  =  very  close,  dadint  =  not  close, 
livewithmom  =  mostly  mother,  depression  =  not  often,  (gethiched)  marry  =  will 
likely  not  get  married.  Cluster  1  represents  mostly  adolescents  that  are  closer  to 
mom  than  dad.  These  young  adults  do  not  often  feel  depressed  and  they  may 
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1  2  3 

cluster 


Fig.  13.6  Barplot  illustrating  the  features  discriminating  between  the  three  cohorts  in  the  divorce 
sonsequences  on  young  adults  dataset 


avoid  getting  married.  These  young  adults  tends  to  be  not  be  too  emotional  and  do 
not  value  family. 

•  Cluster  2:  divyear  =  mostly  89,  momint  =  not  close,  dadint  =  very  close, 
livewithmom  =  father,  depression  =  mild,  marry  =  do  not  know/not  inclined. 
Cluster  2  includes  children  that  mostly  live  with  dad  and  only  feel  close  to  dad. 
These  people  don’t  felt  severely  depressed  and  are  not  inclined  to  marry.  These 
young  adults  may  prefer  freedom  and  tend  to  be  more  naive. 

•  Cluster  3:  divyear  =  mix  of  89  and  90,  momint  =  not  close,  dadint  =  not  at  all, 
livewithmom  =  mother,  depression  =  sometimes,  marry  =  tend  to  get  married. 
Cluster  3  contains  children  that  did  not  feel  close  to  either  dad  or  mom.  They 
sometimes  felt  depressed  and  are  willing  to  build  their  own  family.  These  young 
adults  seem  to  be  more  independent. 

We  can  see  that  these  three  different  clusters  do  contain  three  alternative  types  of 
young  adults.  Bar  plots  provide  an  alternative  strategy  to  visualize  the  difference 
between  clusters  (Fig.  13.6). 


par(mfroiAj=c(lj  1),  mor=c(4j  4,  4,  2)) 

myCoLors  <-  c("darkblue" ,  "red",  "green",  "brown",  "pink",  "purple",  "yellow") 
barplot (t (diz_clussters$centers) ,  beside  =  TRUE,  xlab=" cluster" , 
ylab="value",  col  =  myCoLors) 

legend( "topleft" ,  ncol=2.  Legend  =  c( "DIVYEAR" ,  "momint",  "dadint", 
"momclose" ,  "depression",  "livewithmom",  "gethitched") ,  fill  =  myCoLors) 
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For  each  of  the  three  clusters,  the  bars  in  the  plot  above  represent  the  following 
order  of  features  D I VYEAR,  momint,  dadint,  momclose,  depression, 
livewithmom,  gethitched. 


13.4.5  Step  5:  Usage  of  Cluster  Information 


Clustering  results  could  be  utilized  as  new  information  augmenting  the  original 
dataset.  For  instance,  we  can  add  a  cluster  label  in  our  divorce  dataset: 


divor ce$c Lust er s< -diz_cLusster s$c Luster 
divorce [1 : 5,  ] 


## 
##  1 
##  2 
##  3 
##  4 
##  5 
## 
##  1 
##  2 
##  3 
##  4 
##  5 


DIVYEAR  momint  dadint  momcLose  depression  Livewithmom  gethitched 
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We  can  also  examine  the  relationship  between  live  with  mom  and  feel  close  to 
mom  by  displaying  a  scatter  plot  of  these  two  variables.  If  we  suspect  that  young 
adults’  personality  might  affect  this  relationship,  then  we  could  consider  the  poten¬ 
tial  personality  (cluster  type)  in  the  plot.  The  cluster  labels  associated  with  each 
participant  are  printed  in  different  positions  relative  to  each  pair  of  observations, 
(livewithmom,  momint)  (Fig.  13.7). 

Fig.  13.7  Drill  down  for 
one  feature  (leave-with- 
mom)  between  the  three  4 
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require ( ggp Lot 2 ) 

ggpLot(divorcej  aes( Livewithmomj  momint)j  main="ScatterpLot  Live  with  mom  vs 
feeL  cLose  to  mom")  + 

geom_point(aes(coLour  =  factor (cLusters) j  shape=factor (c Lusters )_,  stroke  = 
8)j  aLpha=l)  + 

theme_bw(base_size=25)  + 

geom_text(aes( LabeL=ifeLse(cLusters%in%lj  as. character (cLusters) j  ' hju 
st=2j  vjust=2j  coLour  =  f actor (cLusters) ) )  + 

geom_text(aes( LabeL=ifeLse(cLusters%in%2j  as. character (cLusters) }  hju 

st=-2j  vjust=2j  coLour  =  factor (cLusters) ) )+ 

geom_text(aes( LabeL=ifeLse(cLusters%in%3j  as. character (cLusters) j  ' hju 
st=2j  vjust=-lj  coLour  =  f actor (c Lusters) ) )  + 

guides(coLour  =  guide_Legend(override.aes  =  List(size=8) ) )  + 
theme ( Legend . position= "top ") 

We  used  ggplot  ( )  function  in  ggplot2  package  to  label  points  with  cluster 
types,  ggplot  (divorce,  aes  (livewithmom,  momint)  )  +  geom_point 
( )  gives  us  the  scatterplot,  and  the  three  geom_text  ( )  functions  help  us  label  the 
points  with  the  corresponding  cluster  identifiers. 

This  picture  shows  that  live  with  mom  does  not  necessarily  mean  young  adults 
will  feel  close  to  mom.  For  “emotional”  (Cluster  1)  young  adults,  they  felt  close  to 
their  mom  whether  they  live  with  their  mom  or  not.  “Naive”  (Cluster  2)  young  adults 
feel  closer  to  mom  if  they  live  with  mom.  However,  they  tend  to  be  estranged  from 
their  mother.  “Independent”  (Cluster  3)  young  adults  are  opposite  to  Cluster  1.  They 
felt  closer  to  mom  if  they  don’t  live  with  her. 


13.5  Model  Improvement 

Let’s  still  use  the  divorce  data  to  illustrate  a  model  improvement  using  k-means++. 
(Appropriate)  initialization  of  the  k-means  algorithm  is  of  paramount  importance. 
The  k-means++  extension  provides  a  practical  strategy  to  obtain  an  optimal  initial¬ 
ization  for  k-means  clustering  using  a  predefined  kpp_init  method. 

#  install. packages ("mat rixStats") 
require ( matrixStats ) 

kpp_init  =  function(datj  K)  { 
x  =  as.matrix(dat) 
n  =  nrow(x) 

#  Randomly  choose  a  first  center 

centers  =  matrix(NAj  nrow=Kj  ncoL=ncoL(x) ) 

set.seed(123) 

centers[lj ]  =  as.matrix(x[sampLe(l:nj  l)j]) 
for  (k  in  2:K)  { 

#  Calculate  distA2  to  closest  center  for  each  point 
dists  =  matrix(NAj  nrow=nj  ncoL=k-l) 
for  (j  in  l:(k-l))  { 

temp  =  sweep(Xj  2}  centers[jj]j  '-') 
dists[jj]  =  rowSums(temp/K2) 

} 

dists  =  rowMins (dists) 
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#  Draw  next  center  with  probability  proportional  to  distA2 

cumdists  =  cumsum(dists) 

prop  =  runif(lj  min=0j  max=cumdists [n] ) 

centers[kj ]  =  as.matrix(x[min(which(cumdists  >  prop))j]) 

} 

return(centers ) 

} 

cLust_kpp  =  kmeans(di_Zj  kpp_init(di_Zj  3) }  iter.max=100j  aLgorithm=  ' LLoyd ' ) 

We  can  observe  some  differences. 


ciust_kpp$centers 

##  DIVYEAR  momint  dadint 

##  1  0.3741445  1.2578161  -0.6636602 

##  2  -0.2659149  -0.5798266  0.3805174 

##  3  0.3508225  0.5269697  -0.4329499 

##  gethitched 
##  1  0.9990071 

##  2  -0.1489684 
##  3  -0.2285310 


momcLose  depression  Livewithmonn 
0.5610651  -0.1505730  -0.4124815 
-0.2538624  0.1639572  -0.5560862 
0.2251408  -0.2594488  1.3770536 


This  improvement  is  not  substantial;  the  new  overall  average  Silhouette  value 
remains  0.2  for  k-means++.  Third  compares  to  the  value  of  0.2  reported  for  the 
earlier  k-means  clustering,  albeit  the  three  groups  generated  by  each  method  are 
quite  distinct.  In  addition,  the  number  of  “mis-clustered”  instances  remains 
2  although  their  Silhouette  values  are  rather  smaller  than  before,  and  the  overall 
Cluster  1  Silhouette  average  value  is  low  (0.006)  (Fig.  13.8). 


Silhouette  plot  of  (x  =  clust_kpp$cfuster,  dist  =  dis) 


n  =  47 


3  clusters  C 
1 :  nj  |  avelr:Cf  s, 
1  :  7  I  0.006 


2  :  27  |  0.25 


3  :  13  |  0.19 


1 


0.0  0.2  0.4  0.6  0.6  1.0 


Silhouette  width  s. 


Average  silhouette  width  :  0.2 


Fig.  13.8  Silhouette  plot  for  k-means++  classification 
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0.17- 


0.16- 


Fig.  13.9  Evolution  of  the  average  silhouette  value  with  respect  to  the  number  of  clusters 


siL2  =  silhouette (clust_kpp$clusterJ  dis) 
summary ( si  12) 

##  Silhouette  of  47  units  in  3  clusters  from  silhouette . default (x  =  clust_hp 
p$clusterj  dist  =  dis)  : 

##  Cluster  sizes  and  average  silhouette  widths: 

##  7  27  13 

##  0.00644352  0.24933847  0.19476785 
##  Individual  silhouette  widths: 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  -0.12750  0.08781  0.22950  0.19810  0.29050  0.38120 

plot(sil2) 


13.5.1  Tuning  the  Parameter  k 

Similar  to  what  we  performed  for  KNN  and  SVM,  we  can  tune  the  k-means 
parameters,  including  centers  initialization  and  k  (Fig.  13.9). 
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n_rows  <-  21 

mat  =  matrix(0,  nrow  =  n_rows) 
for  (i  in  2 :  n_rows){ 
set. seed (321 ) 

cLust_kpp  =  hmeans(di_z,  hpp_init(di_z,  i) ,  iter.max=100,  aLgorithm= ' LLoyd 

') 

siL  =  silhouette (cLust_kpp$cLuster,  dis) 
mat[i]  =  mean (as .matrix(siL) [ , 3]) 

} 

coLnames(mat)  <-  c("Avg_Siihouette_Vaiue"  ) 
mat 


## 

Avg_Silhouette_Value 

## 

[1,] 

0. 0000000 

## 

[ 2 ,] 

0.1948335 

## 

[3,] 

0.1980686 

## 

[ 4 ,j 

0.1789654 

## 

[ 5 ,] 

0.1716270 

## 

[6,] 

0.1546357 

## 

[7,] 

0.1622488 

## 

[ 8 ,] 

0.1767659 

## 

[9,] 

0.1928883 

##  [10 J  ] 

0.2026559 

##  [11 j] 

0.2006313 

##  [12,  ] 

0.1586044 

##  [13,] 

0.1735035 

##  [14,] 

0.1707446 

##  [15,] 

0.1626367 

##  [16,  ] 

0.1609723 

##  [17,] 

0.1785733 

##  [18,  ] 

0.1839546 

##  [19,  ] 

0.1660019 

##  [20,  ] 

0.1573574 

##  [21,] 

0.1561791 

ggpLot (data . frame (k=2 :  n_rows ,  siL=mat[2 :n_rows] ) ,  aes(x=k,y=sii) )+ 
geom_Line( )+ 

scaie_x_continuous  (breaks  =  2:n_ro\A/s) 

This  suggests  that  k  ~  3  may  be  an  appropriate  number  of  clusters  to  use  in  this  case. 
Next,  let’s  set  the  maximal  iteration  of  the  algorithm  and  rerun  the  model  with 
optimal  k  =  2,k  =  3  ork  =  10.  Below,  we  just  demonstrate  the  results  for  k  =  3. 
There  are  still  2  mis-clustered  observations,  which  is  not  a  significant  improvement 
on  the  prior  model  according  to  the  average  Silhouette  measure  (Fig.  13.10). 

k  <-  3 
set.seed(31 ) 

cLust_kpp  =  hmeans(di_Zj  hpp_init(di_Zj  k)j  iter .max=200}  aigorithm="MacQuee 
n") 

siL3  =  silhouette (cLust_kpp$ciusterj  dis) 
summary (si  13) 

##  Silhouette  of  47  units  in  3  clusters  from  silhouette .default (x  =  clust_kp 
p$clusterj  dist  =  dis)  : 

##  Cluster  sizes  and  average  silhouette  widths: 

##  10  22  15 

##  0.02096194  0.30414984  0.15474729 
##  Individual  silhouette  widths: 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  -0.1365  0.1032  0.1971  0.1962  0.3122  0.4113 


plot( si!3) 
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Silhouette  plot  of  {x  =  clust_kpp$ciuster,  dist  =  dis) 

n  -  47  3  clusters  C( 

j :  nj  |  ave,eC|  s, 

1  ;  10  |  0  02 


2  :  22  |  0.30 


3  ;  15  |  0.15 

I - 1 - 1 - 1 - 1 - 1 

0.0  0,2  0.4  0.6  0.8  1.0 

Silhouette  width  s. 

Average  silhouette  width  :  0.2 

Fig.  13.10  Silhouette  plot  for  the  optimal  k  =  3  andd  kpp_init  Initialization 

Note  that  we  now  see  3  cases  of  group  1  that  have  negative  silhouette  values 
(previously  we  had  only  2),  albeit  the  overall  average  silhouette  remains  0.2. 

13.6  Case  Study  2:  Pediatric  Trauma 

Let’s  go  through  another  example  demonstrating  the  k-means  clustering  method 
using  a  larger  dataset. 


13.6.1  Step  1:  Collecting  Data 

The  dataset  we  will  interrogate  now  includes  Services  Utilization  by  Trauma- 
Exposed  Children  in  the  US  data,  which  is  located  in  our  case-studies  folder.  This 
case  study  examines  associations  between  post-traumatic  psychopathology  and 
service  utilization  by  trauma-exposed  children. 

Variables: 

•  id:  Case  identification  number. 

•  sex:  Female  or  male,  dichotomous  variable  (1  =  female,  0  =  male). 

•  age:  Age  of  child  at  time  of  seeking  treatment  services.  Interval-level  variable, 
score  range  =  0-18. 

•  race:  Race  of  child  seeking  treatment  services.  Polytomous  variable  with  4  cate¬ 
gories  (1  =  black,  2  =  white,  3  =  hispanic,  4  =  other). 
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•  cmt:  The  child  was  exposed  to  child  maltreatment  trauma  -  dichotomous  variable 
(1  =  yes,  0  =  no). 

•  traumatype:  Type  of  trauma  exposure  the  child  is  seeking  treatment  sore. 
Polytomous  variable  with  5  categories  ("sexabuse"  =  sexual  abuse, 
"physabuse"  =  physical  abuse,  "neglect"  =  neglect,  "psychabuse"  =  psychologi¬ 
cal  or  emotional  abuse,  "dvexp"  =  exposure  to  domestic  violence  or  intimate 
partner  violence). 

•  ptsd:  The  child  has  current  post- traumatic  stress  disorder.  Dichotomous  variable 
(1  =  yes,  0  =  no). 

•  dissoc:  The  child  currently  has  a  dissociative  disorder  (PTSD  dissociative 
subtype,  DESNOS,  DDNOS).  Interval-level  variable,  score  range  =  0-11. 

•  service:  Number  of  services  the  child  has  utilized  in  the  past  6  months,  including 
primary  care,  emergency  room,  outpatient  therapy,  outpatient  psychiatrist,  inpatient 
admission,  case  management,  in-home  counseling,  group  home,  foster  care,  treat¬ 
ment  foster  care,  therapeutic  recreation  or  mentor,  department  of  social  services, 
residential  treatment  center,  school  counselor,  special  classes  or  school,  detention 
center  or  jail,  probation  officer.  Interval-level  variable,  score  range  =  0-19. 

•  Note:  These  data  (Case  04  ChildTrauma  .  Data  .  csv)  are  tab-delimited. 


13.6.2  Step  2:  Exploring  and  Preparing  the  Data 


First,  we  need  to  load  the  dataset  into  R  and  report  its  summary  and  dimensions. 


trauma<-read.  csv ("https  ://umich.  instructure .  com/files/399129/doiAjnload?doiAjnlo 
ad_frd=l"J  sep  =  "  ") 
summary ( trauma ) ;  dim( trauma ) 

##  id  sex  age  ses 


## 

Min. 

1.0 

Min.  : 0.000 

Min. 

2. 

000  Min.  :0.00 

## 

1st  Qu. 

250.8 

1st  Qu. :0. 000 

1st  Qu. 

7. 

000  1st  Qu.:0.00 

## 

Median 

500.5 

Median  : 1.000 

Median 

9. 

000  Median  :0.00 

## 

Mean 

500.5 

Mean  : 0.506 

Mean 

8.982  Mean  :0.18 

## 

3rd  Qu. 

750.2 

3rd  Qu. :1. 000 

3rd  Qu. 

11. 

000  3rd  Qu. :0.00 

## 

Max. 

1000.0 

Max.  : 1.000 

Max. 

25. 

000  Max.  : 1.00 

## 

race 

traumatype 

ptsd 

dissoc 

## 

black 

:200 

dvexp  :250 

Min.  :0.00 

Min.  : 0.000 

## 

hispanic : 100 

neglect  :350 

1st  Qu. :0. 00 

1st  Qu. :0. 000 

## 

other 

:  100 

physabuse  :100 

Median  :0.00 

Median  : 1.000 

## 

white 

:600 

psychabuse : 200 

Mean  :0.29 

Mean  : 0.598 

## 

sexabuse  :100 

3rd  Qu. :1.00 

3rd  Qu. :1. 000 

## 

Max.  : 1.00 

Max.  : 1.000 

## 

service 

## 

Min. 

0. 000 

## 

1st  Qu. 

8.000 

## 

Median 

10.000 

## 

Mean 

9.926 

## 

3rd  Qu. 

12.000 

## 

Max. 

20.000 

##  [1]  1000 
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In  the  summary  we  see  two  factors  race  and  trauma  type.  Trauma  type 
codes  the  real  classes  we  are  interested  in.  If  the  clusters  created  by  the  model  are 
quite  similar  to  the  trauma  types,  our  model  may  have  a  quite  reasonable  interpre¬ 
tation.  Let’s  also  create  a  dummy  variable  for  each  racial  category. 

trauma$black<-ifeLse(trauma$race=="black"j  1,  0) 
trauma$hispanic<-ifeLse(trauma$race=="hispanic"j  1}  0) 
trauma$other<-ifeLse(trauma$race=="other"j  1,  0) 
trauma$white<-ifeLse(trauma$race=="white"j  1}  0) 

Then,  we  will  remove  trauma  type  the  class  variable  from  the  dataset  to  avoid 
biasing  the  clustering  algorithm.  Thus,  we  are  simulating  a  real  biomedical  case- 
study  where  we  do  not  necessarily  have  the  actual  class  information  available,  i.e., 
classes  are  latent  features. 

trauma_notype< -trauma [j  -c(lj  5}  6)] 


13.6.3  Step  3:  Training  a  Model  on  the  Data 

Similar  to  case-study  1,  let’s  standardize  the  dataset  and  fit  a  k-means  model. 


tr_z<-  as.data.frame(LappLy(trauma_notypej  scale)) 
str(tr_z) 


## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


' data .  frame ' : 

$  sex  :  num 
$  age  :  num 
$  ses  :  num 
$  ptsd  :  num 
$  dissoc  :  num 
$  service  :  num 
$  black  :  num 
$  hispanic:  num 
$  other  :  num 
$  ifijhite  :  num 


1000  obs.  of  10  variables : 

0.988  0.988  -1.012  -1.012  0.988  ... 
-0.997  1.677  -0.997  0.674  -0.662  ... 
-0.468  -0.468  -0.468  -0.468  -0.468  ... 
1.564  -0.639  -0.639  -0.639  1.564  ... 
0.819  -1.219  0.819  0.819  0.819  ... 
2.314  0.678  -0.303  0.351  1.66  ... 
22222  ... 

-0.333  -0.333  -0.333  -0.333  -0.333  ... 
-0.333  -0.333  -0.333  -0.333  -0.333  ... 
-1.22  -1.22  -1.22  -1.22  -1.22  ... 


set.seed( 1234) 

trauma_clusters<-kmeans(tr_Zj  6) 


Here  we  use  k  —  6  in  the  hope  that  we  may  have  5  of  these  clusters  match  the 
specific  5  trauma  types.  In  this  case  study,  we  have  1000  observations  and  k  =  6  may 
be  a  reasonable  option. 


value 
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Fig.  13.11  Key  predictors  discriminating  between  the  6  cohorts  in  the  trauma  study 


13.6.4  Step  4:  Evaluating  Model  Performance 


To  assess  the  clustering  model  results,  we  can  examine  the  resulting  clusters 
(Fig.  13.11). 


trauma  cLusters$centers 


## 
##  1 
##  2 
##  3 
##  4 
##  5 
##  6 
## 
##  1 
##  2 
##  3 
##  4 
##  5 
##  6 


sex 

-0.001999144 

-1.011566709 

0.026286613 

0.067970886 

0.047979449 

0.987576985 


age 

0.061154336 

-0.006361734 

-0.043755817 

0.046116384 

-0.104263129 

0.028799955 


ses 

-0.091055799 

-0.000275301 

0.029890657 

-0.078047828 

0.156095655 

0.019511957 


ptsd 

-0.077094361 

0.002214351 

0.064206246 

0.044053921 

0.022026960 

-0.038046568 


dissoc 

0.02446247 

0.81949287 

-1.21904661 

-0.07746450 

-0.09784989 

0.81949287 


service 

0.001308569 

0.126332303 

-0.030083167 

0.128894052 

-0.103376956 

-0.111481162 


black 
1 . 9989997 
-0.4997499 
-0.4997499 
-0.4997499 
-0.4997499 
-0.4997499 


hispanic 

-0.3331666 

-0.3331666 

-0.3331666 

-0.3331666 

2.9984996 

-0.3331666 


other 

-0.3331666 

-0.3331666 

-0.3331666 

2.9984996 

-0.3331666 

-0.3331666 


white 

-1.2241323 

0.8160882 

0.8160882 

-1.2241323 

-1.2241323 

0.8160882 


my  Colors  <-  c("darkblue"j  "red"j  "green"  j  "brown", 
Lue ",  "orange ",  "grey",  "ye L Low") 
barplot(t(trauma_clusters$centers) ,  beside  =  TRUE , 
ylab="value"j  col  =  myColors) 

Legend( "topleft" ,  ncol=4 ,  Legend  =  c("sex ",  "age", 
"service",  "black",  "hispanic",  "other",  "white"). 


"pink",  "purple",  "Lightb 

xlab="cluster" , 

"ses",  "ptsd",  "dissoc", 
fill  =  myColors) 
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On  this  barplot,  the  bars  in  each  cluster  represents  sex,  age,  ses,  ptsd, 
dissoc,  service,  black,  hispanic,  other,  and  white,  respectively. 
It  is  quite  obvious  that  each  cluster  has  some  unique  features. 

Next,  we  can  compare  the  k-means  computed  cluster  labels  to  the  original  labels. 
Let’s  evaluate  the  similarities  between  the  automated  cluster  labels  and  their  real 
class  counterparts  using  a  confusion  matrix  table,  where  rows  represent  the  k-means 
clusters,  columns  show  the  actual  labels,  and  the  cell  values  include  the  frequencies 
of  the  corresponding  pairings. 


t ra uma$clusters<-trauma_c Lust ers$c Luster 
tabLe( trouma$cLusterSj  trauma$traumatype) 


## 

## 

dvexp  neglect  physabuse  psychabuse 

s exabuse 

## 

1 

0 

0 

100 

0 

100 

## 

2 

10 

118 

0 

61 

0 

## 

3 

23 

133 

0 

79 

0 

## 

4 

100 

0 

0 

0 

0 

## 

5 

100 

0 

0 

0 

0 

## 

6 

17 

99 

0 

60 

0 

We  can  see  that  all  of  the  children  in  Cluster  4  belong  to  dvexp  (exposure  to 
domestic  violence  or  intimate  partner  violence).  If  we  use  the  mode  of  each  cluster  to 
be  the  class  for  that  group  of  children,  we  can  classify  63  s exabuse  cases, 
279  neglect  cases,  41  phys abuse  cases,  100  dvexp  cases,  and  another 
71  neglect  cases.  That  is  554  cases  out  of  1,000  cases  identified  with  correct 
class.  The  model  has  a  problem  in  distinguishing  between  neglect  and 
psychabuse,  but  it  has  a  good  accuracy. 

Let’s  review  the  output  Silhouette  value  summary.  It  works  well  as  only  a  small 
portion  of  samples  appear  mis-clustered. 


dis_tra  =  dist(tr_z) 

siL_tra  =  siLhouette(trauma_cLusters$cLusterj  dis_tra) 
summary ( si  L_  tra ) 

##  Silhouette  of  1000  units  in  6  clusters  from  silhouette . default (x  =  trauma 
_clusters$c Luster j  dist  =  dis_tra)  : 

##  Cluster  sizes  and  average  silhouette  widths: 

##  200  189  235  100  100  176 

##  0.2595725  0.2185706  0.1039559  0.3223076  0.3199830  0.2423110 

##  Individual  silhouette  widths: 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  0.008893  0.139100  0.234400  0.224500  0.303300  0.388200 

#plot (sil_t ra) 

#  report  the  overall  mean  silhouette  value 
mean(sil_tra[j  "sil_width" ] ) 

##  [1]  0.2245298 

#  The  sil  object  colnames  are  ("cluster".,  "neighbor".,  "sil_width") 
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Fig.  13.12 


Evolution  of  the  average  silhouette  value  with  respect  to  the  number  of  clusters 


Next,  let’s  try  to  tune  k  with  k-means++  and  see  if  k 
(Fig.  13.12). 


6  appears  to  be  optimal 


mat  =  matrix(0,nroiAj  =11) 
for  (i  in  2:11){ 
set . seed (321) 

ciust_kpp  =  kmeans(tr_z,  kpp_init(tr_z,  i)j  iter.max=100j  aLgorithm= ' LLoyd 

') 

siL  =  silhouette (cLust_kpp$cLuster,  dis_tra) 
mat[i]  =  mean ( as. matrix ( si L) [ , 3] ) 

} 

mat 


## 

##  [1}] 
##  [ 2 ,] 
##  [3j ] 

##  [ 4 ,] 

##  [ 5 ,] 

##  [6}] 
##  [ 7 ,] 

##  [ 8 ,] 

##  [ 9 3] 

##  [10,  ] 
##  [11,] 


[,1] 

0. 0000000 
0.2433222 
0.1675486 
0.1997315 
0.2116534 
0.2400086 
0.2251367 
0.2199859 
0.2249569 
0.2347122 
0.2304451 


ggpLot (data .  frame (k=2 : 11 , siL=mat[2:ll] ) ,aes(x=k,y=sii) )+geom_Line( )+scaLe_x_ 
continuous (breaks  =  2:11) 
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Finally,  let’s  use  k-means++  with  k  =  6  and  set  the  algorithm’s  maximal  iteration 
before  rerunning  the  experiment: 

set ,seed(1234) 

clust_hpp  =  kmeans(tr_Zj  kpp_init(tr_Zj  6) >  iter .max=100j  algorithm= ' Lloyd ' ) 
siL  =  silhouette (clust_kpp$clusterj  dis_tra) 
summary ( si L) 

##  SiLhouette  of  1000  units  in  6  c Lusters  from  silhouette .default (x  =  clust_ 
hpp$clusterj  dist  =  dis_tra)  : 

##  Cluster  sizes  and  average  silhouette  widths: 

##  422  100  178  85  15  200 

##  0.2166778  0.3353976  0.1898492  0.2478090  0.2294502  0.2836607 

##  Individual  silhouette  widths: 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  0.03672  0.19730  0.23080  0.24000  0.27710  0.40650 

#  plot(sil) 

#  report  the  overall  mean  silhouette  value 
mean(sil[j  "sil_width" ] ) 

##  [1]  0.2400086 


13.6.5  Practice  Problem:  Youth  Development 


Use  the  Boys  Town  Study  of  Youth  Development  data,  second  case  study, 
CaseStudy02_Boystown_Data.csv,  which  we  used  in  Chap.  7,  to  find  clusters 
using  variables  like  GPA,  alcohol  abuse,  attitudes  on  drinking,  social  status,  parent 
closeness,  and  delinquency  for  clustering  (all  variables  other  than  gender  and  ID). 

First,  we  must  load  the  data  and  transfer  sex,  dad  job,  and  momjob  into 
dummy  variables. 


boy stown<- read. csv (" https :// umich . instructure . com/files/399119/download?down 

Load_frd=l"j  sep="  ") 

boy  stown$sex< -boy  stown$sex-l 

boystown$dadjob  <-  ( -l)*(boystown$dadjob-2) 

boystown$momjob  <-  ( -l)*(boystown$momjob-2) 

str(boystown) 


## 

'data,  frame ' : 

200  obs. 

of 

11  variables : 

## 

$  id 

int 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10  .  . 

## 

$  sex 

num 

0 

0 

0 

0 

1 

1 

0 

0 

1 

1  .  . 

## 

$  gpa 

int 

5 

0 

3 

2 

3 

3 

1 

5 

1 

3  ..  . 

## 

$  Alcoholuse : 

int 

2 

4 

2 

2 

6 

3 

2 

6 

5 

2  ... 

## 

$  alcatt 

int 

3 

2 

3 

1 

2 

0 

0 

3 

0 

1  ... 

## 

$  dad job 

num 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1  ... 

## 

$  momj ob 

num 

0 

0 

0 

0 

1 

0 

0 

0 

1 

1  ... 

## 

$  dadclose 

int 

1 

3 

2 

1 

2 

1 

3 

6 

3 

1  ... 

## 

$  mom close 

int 

1 

4 

2 

2 

1 

2 

1 

2 

3 

2  ... 

## 

$  Larceny 

int 

1 

0 

0 

3 

1 

0 

0 

0 

1 

1  ... 

## 

$  vandalism  : 

int 

3 

0 

2 

2 

2 

0 

5 

1 

4 

0  ... 
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■  gpa  ■  dadjob  □  momclose 

■  Alcoholuse  □  momjob  □  larceny 

□  aleatt  □  dad  close  □  vandalism 


1  2  3 

cluster 


Fig.  13.13  Main  features  discriminating  between  the  3  cohorts  in  the  divorce  impact  on  youth 
study 


Then,  extract  all  the  variables,  except  the  first  two  columns  (subject  identifiers 
and  genders). 


boystown_sub<-boystown[j  -c(l}  2)] 

Next,  we  need  to  standardize  and  clustering  the  data  with  k  =  3.  You  may  have 
the  following  centers  (numbers  could  be  a  little  different)  (Fig.  13.13). 


##  gpa  Alcoholuse 
##  1  -0.5101243  -0.08555163 
##  2  -0.2753631  0.49998217 
##  3  0.6590193  -0.51256447 
##  momclose  larceny 
##  1  0.65647213  -0.1755012 
##  2  -0.33341358  -0.4017282 
##  3  -0.06343891  0.5769583 


aleatt 

-0.30098866 

0.13804858 

0.04599325 

vandalism 

-0.4453044 

0.5252308 

-0.2981561 


dadjob 

0.1939577 

■0.2421906 

0.1451756 


momjob 

0.04868109 

■0.30151766 

0.31107377 


dadclose 

1.1914502 

■0.4521484 

■0.2896562 


Add  k-means  cluster  labels  as  a  new  (last)  column  back  in  the  original  dataset. 
To  investigate  the  gender  distribution  within  different  clusters  we  may  use 
aggregate  ( ) . 


#  Compute  the  averages  for  the  variable  'sex',  grouped  by  cluster 
aggregate (data=boysto\^n3  sex~clustersj  mean) 


##  clusters  sex 
##  1  1  0.6875000 
##  2  2  0.5802469 
##  3  3  0.6760563 


Here  clusters  is  the  new  vector  indicating  cluster  labels.  The  gender  distri¬ 
bution  does  not  vary  much  between  different  cluster  labels  (Fig.  13.14). 


13.7  Hierarchical  Clustering 
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Number  of  clusters  (K  value): 
Pause 


K-meatis  demo 


Cluster  0 
Cluster  % 
Cluster  2 
Cluster  3 
Cluster  4 
Cluster  5 
Centroids 
Initial  means 


■  Iteration  #26/26 

■  Means:  (-0.6073 895478600629,-0. 191452 14tf  2406423),  (0.4531704303 132247,-0.27530539050630537), 

r.n  An<m6?Q l  n  mA7jM6ydi  1 4 s stidtiR’i  t 1 n  ?  rn fiR s  i  nvr \ ? i so j.oa id  i? i  n  m nd/ 


Fig.  13.14  Live  demo:  k-means  point  clustering 


This  k-means  live  demo  shows  point  clustering  (Applies  Multiclass  AdaBoost.Ml, 
SAMME  and  Bagging  algorithm)  http://olalonde.github.com/kmeans.js. 

13.7  Hierarchical  Clustering 

There  are  a  number  of  R  hierarchical  clustering  packages,  including: 

•  he  lust  in  base  R. 

•  agnes  in  the  cluster  package. 

Alternative  distance  measures  (or  linkages)  can  be  used  in  all  Hierarchical 
Clustering,  e.g.,  single ,  complete  and  ward. 

We  will  demonstrate  hierarchical  clustering  using  case-study  1  (Divorce  and 
Consequences  on  Young  Adults).  Pre-set  k  =  3  and  notice  that  we  have  to  use 
normalized  data  for  hierarchical  clustering. 

require(cLuster) 

pitch_sing  =  agnes(di_Zj  diss=FALSEj  method=' single' ) 
pitch_comp  =  agnes(di_Zj  diss=FALSEj  method=' complete  ' ) 
pitch_ward  =  agnes(di_Zj  diss=FALSEj  method= 'ward ' ) 
siL_sing  =  siLhouette(cutree(pxtch_singJ  k=3) >  dis) 
siL_comp  =  siLhouette(cutree(pitch_compJ  k=3) ,  dis) 

#  try  10  clusters,  see  plot  above 

siL_ward  =  siLhouette(cutree(pitch_wardj  k=10) ,  dis) 

You  can  generate  the  hierarchical  plot  by  ggdendrogram  in  the  package 
ggdendro  (Figs.  13.15  and  13.16). 
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Fig.  13.15  Hierarchical  clustering  using  the  Ward  method 


■*-  cn  rMfOO^T-r^cv  c'j'a-^-T'T-^CNfr'JC'ifN'j-  *-  t-  t-  e\i 


Fig.  13.16  Ten-level  hierarchical  clustering  using  the  Ward  method 


#  install. packages( "ggdendro" ) 
require ( ggdendro ) 

ggdendrogram(as.dendrogram(pitch_\A/ard) ieaf_iabeis=FALSEj  LabeLs=FALSE) 


13.7  Hierarchical  Clustering 
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Silhouette  plot  of  (x  =  cutree(pitch_ward,  k 

n  =  47 


0.0  0.2  0.4  0.6 

Silhouette  width  s, 


10),  dist  =  dis) 


10  clusters  C 


2:  5  |  0.29 


3:  6  |  0.29 
4  :  3  |  -0.02 
5:  6  |  0.19 


6:  12  |  0.26 


7:  5  |  0.33 

8  :  2  |  0.44 
9:3  1  0.09 
10:  1  |  0.00 

“1 - 1 

0.8  1.0 


Average  silhouette  width  :  0.24 


Fig.  13.17  Silhouette  plot  for  hierarchical  clustering  using  the  Ward  method 


mean(sil_ward[  "sil_width"  ]  ) 

##  [1]  0.2398738 

ggdendrogram(as.dendrogram(pitch_ward)  j  Leaf_labels=TRUEJ  LabeLs=Tj  size=10) 

Generally  speaking,  the  best  result  should  come  from  wald  linkage,  but  you 
should  also  try  complete  linkage  (method  =  ‘complete’).  We  can  see  that  the 
hierarchical  clustering  result  (average  silhouette  value  ~0.24)  mostly  agrees  with 
the  prior  k-means  (0.2)  and  k-means++  (0.2)  results  (Fig.  13.17). 

summary (si  l_ward ) 

##  SiLhouette  of  47  units  in  10  dusters  from  silhouette .default (x  =  cutree( 
pitch_ifljardj  k  =  10) ,  dist  =  dis)  : 


## 

## 

Cluster  sizes  and  average  silhouette 
4  5  6 

widths: 

3 

6 

12 

## 

0.25905454 

0.29195989 

0.29305926  - 

0.02079056 

0.19263836 

0.26268274 

## 

## 

5 

0.32594365 

2 

0.44074717 

3 

0.08760990 

1 

0. 00000000 

##  Individual  silhouette  widths: 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 
##  -0.1477  0.1231  0.2577  0.2399  0.3524  0.5176 


plot(sil_ward) 
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13.8  Gaussian  Mixture  Models 

More  details  about  Gaussian  mixture  models  (GMM)  are  provided  in  the  supporting 
materials  online.  Below  is  a  brief  introduction  to  GMM  using  the  Me  lust  function 
in  the  R  package  me  lust. 

For  multivariate  mixture,  there  are  totally  14  possible  models: 

•  "EH"  =  spherical,  equal  volume 

•  "VII"  =  spherical,  unequal  volume 

•  "EEI"  =  diagonal,  equal  volume  and  shape 

•  "VEI"  =  diagonal,  varying  volume,  equal  shape 

•  "EVI"  =  diagonal,  equal  volume,  varying  shape 

•  "VVI"  =  diagonal,  varying  volume  and  shape 

•  "EEE"  =  ellipsoidal,  equal  volume,  shape,  and  orientation 

•  "EVE"  =  ellipsoidal,  equal  volume  and  orientation  (*) 

•  "VEE"  =  ellipsoidal,  equal  shape  and  orientation  (*) 

•  "VVE"  =  ellipsoidal,  equal  orientation  (*) 

•  "EEV"  =  ellipsoidal,  equal  volume  and  equal  shape 

•  "VEV"  =  ellipsoidal,  equal  shape 

•  "EVV"  =  ellipsoidal,  equal  volume  (*) 

•  "VVV"  =  ellipsoidal,  varying  volume,  shape,  and  orientation 

For  more  practical  details,  you  may  refer  to  Mclust.  For  more  theoretical  details, 
see  C.  Fraley  and  A.  E.  Raftery  (2002). 

Let’s  use  the  Divorce  and  Consequences  on  Young  Adults  dataset  for  a 
demonstration. 


Library (mclust) 

set.seed( 1234) 
gmm_cLust  =  McLust(di_z) 
gmm_ cLus t$mode L Name 

##  [1]  "EEE'' 

Thus,  the  optimal  model  here  is  "EEE"  (Figs.  13.18,  13.19,  and  13.20). 

pLot(gmm_cLust$BICj  LegendArgs  =  List(x  =  "bottom" j  ncoL  =  2}  cex  =  1)) 
plot (gmm_c Lust j what  =  "density" ) 


plot (gmm_c Lust j what  =  "classification" ) 


13.8  Gaussian  Mixture  Models 
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Fig.  13.18  Bayesian  information  criterion  plots  for  different  GMM  classification  models  for  the 
divorce  youth  data 
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Fig.  13.19  Pairs  plot  of  the  GMM  clustering  density 
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Fig.  13.20  Pairs  plot  of  the  GMM  classification  results 
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13.9  Summary 


•  k-means  clustering  may  be  most  appropriate  for  exploratory  data  analytics.  It  is 
highly  flexible  and  fairly  efficient  in  terms  of  tessellating  data  into  groups. 

•  It  can  be  used  for  data  that  has  no  Apriori  classes  (labels). 

•  Generated  clusters  may  lead  to  phenotype  stratification  and/or  be  compared 
against  known  clinical  traits. 

Try  to  use  these  techniques  with  other  data  from  the  list  of  our  Case-Studies. 


13.10  Assignments:  13.  k-Means  Clustering 

Use  the  Amyotrophic  Lateral  Sclerosis  (ALS)  dataset.  This  case-study  examines  the 
patterns,  symmetries,  associations  and  causality  in  a  rare  but  devastating  disease, 
amyotrophic  lateral  sclerosis  (ALS).  A  major  clinically  relevant  question  in  this 
biomedical  study  is:  What  patient  phenotypes  can  be  automatically  and  reliably 
identified  and  used  to  predict  the  change  of  the  ALSFRS  slope  over  time?.  This 
problem  aims  to  explore  the  data  set  by  unsupervised  learning. 

•  Load  and  prepare  the  data. 

•  Perform  summary  and  preliminary  visualization. 


References 
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•  Train  a  k-means  model  on  the  data,  select  k 

•  as  we  mentioned  in  Chap.  13. 

•  Evaluate  the  model  performance  and  report  the  center  of  clusters  and  silhouette 
plots.  Explain  details  (Note:  Since  we  have  100  dimensions,  it  may  be  difficult  to 
use  bar  plots,  so  show  the  centers  only). 

•  Tune  parameters  and  plot  with  k-means++. 

•  Rerun  the  model  with  optimal  parameters  and  interpret  the  clustering  results. 

•  Apply  Hierarchical  Clustering  on  three  different  linkages  and  compare  the 
corresponding  Silhouette  plots. 

•  Fit  a  Gaussian  mixture  model,  select  the  optimal  model  and  draw  BIC  and 
Silhouette  plots.  (Hint,  you  need  to  sample  part  of  data  or  it  could  be  very  time 
consuming). 

•  Compare  the  result  of  the  above  methods. 
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Chapter  14 

Model  Performance  Assessment 


® 

Check  for 
updates 


In  previous  chapters,  we  used  prediction  accuracy  to  evaluate  classification  models. 
However,  having  accurate  predictions  in  one  dataset  does  not  necessarily  imply  that 
the  model  is  perfect  or  that  it  will  reproduce  when  tested  on  external  data.  We  need 
additional  metrics  to  evaluate  the  model  performance  and  to  make  sure  it  is  robust, 
reproducible,  reliable,  and  unbiased. 

In  this  chapter,  we  will  discuss  (1)  various  evaluation  strategies  for  prediction, 
clustering,  classification,  regression,  and  decision  trees;  (2)  visualization  of  ROC 
curves  and  performance  tradeoffs;  and  (3)  estimation  of  future  performance,  internal 
statistical  cross-validation  and  bootstrap  sampling. 


14.1  Measuring  the  Performance  of  Classification  Methods 

As  mentioned  previously,  classification  model  performances  could  not  be  evaluated 
by  prediction  accuracy  alone.  We  make  different  classification  models  for  different 
purposes.  For  example,  in  newborns  screening  for  genetic  defects  we  want  the  model 
to  have  as  few  true  negatives  as  possible.  We  don’t  want  to  classify  anyone  as  “no 
defect”  when  they  actually  have  a  defect  gene,  since  early  treatment  might  alter  the 
destiny  of  this  newborn. 

We  can  use  the  following  three  types  of  data  to  evaluate  the  performance  of  a 
classifier  model. 

•  Actual  class  values  (for  supervised  classification). 

•  Predicted  class  values. 

•  Estimated  probability  of  the  prediction. 

We  are  familiar  with  the  first  two  cases.  The  last  type  of  validation  relies  on  the 
predict(model,  test_data)  function  that  we  have  talked  about  in  previous  classifica¬ 
tion  and  prediction  chapters  (Chaps.  7,  8,  and  9).  Let’s  revisit  the  model  and  test 
data  we  discussed  in  Chap.  8;  the  Inpatient  Head  and  Neck  Cancer  Medication  data. 


©  Ivo  D.  Dinov  2018 

I.  D.  Dinov,  Data  Science  and  Predictive  Analytics, 
https://doi.org/10.1007/978-3-319-72347-l_14 
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We  will  demonstrate  prediction  probability  estimation  using  this  case-study 
CaseStudyl4_HeadNeck_Cancer_Medication.csv 

pred_raw< -predict (hn_cLassifierj  hn_testj  type="raw" ) 
head(pred_raw) 

##  earLy_stage  Later_stage 

##  [lj]  0. 9381891  0. 06181090 

##  [2/]  0.9381891  0.06181090 

##  [3j ]  0.8715581  0.12844188 

##  [4/]  0.9382140  0.06178601 

##  [5j ]  0.9675997  0.03240026 

##  [6/]  0.9675997  0.03240026 

The  above  output  includes  the  prediction  probabilities  for  the  first  6  rows  of  the 
data.  This  example  is  based  on  the  Naive  Bayes  classifier,  however  the  same 
approach  works  for  any  other  machine-learning  classification  or  prediction 
technique. 

In  addition,  we  can  report  the  predicted  probability  with  the  outputs  of  the  Naive 
Bayesian  decision-support  system  (hn_classifier  <-  naiveBayes 
(hn_train,  hn_med_train$ stage ) ): 


(hn_classif ien  <-  naiveBayes(hn_train,  hn_med_train$stage)): 

pned_nb<- predict (hn_classif ier,  hn_test ) 
head(pred_nb) 

##  [1]  early_stage  early_stage  early_stage  early_stage  early_stage 
early_stage 

##  Levels:  early_stage  later_stage 

The  general  predict  ( )  method  automatically  subclasses  to  the  specific  pre¬ 
dict  . naiveBayes (obj ect ,  newdata,  type  =  c ( "class" ,  "raw")/ 
threshold  =0.001,  .  .  .  )  call  where  type  =  "raw"  and  type  =  "class" 
specify  the  output  as  the  conditional  a-posterior  probabilities  for  each  class  or  the 
class  with  maximal  probability,  respectively.  Back  in  Chap.  9,  we  discussed  the 
C5 . 0  and  the  randomForest  classifiers  used  to  predict  the  chronic  disease  score 
in  a  (different)  Quality  of  Life  Study. 

Below  are  the  (probability)  results  of  the  C5 . 0  classification  prediction: 


pred_prob< -predict (goL_modeLj  goL_testj  type="prob" ) 
head ( pred_prob ) 


## 

##  10 
##  12 
##  26 
##  37 
##  41 
##  43 


minor_disease 
0.1979698 
0.1979698 
0.3468705 
0.1263975 
0. 7290209 
0.3163673 


severe_disease 

0.8020302 

0.8020302 

0.6531295 

0.8736025 

0.2709791 

0.6836327 


14.2  Evaluation  Strategies 


All 


These  can  be  contrasted  against  the  C5 . 0  classification  label  results: 

pred_tree< -predict (qoL_modeLj  qoL_test) 
head ( pred_  tree ) 

##  [1]  severe_disease  severe_disease  severe_disease  severe_disease 
##  [5]  minor_disease  severe_disease 
##  LeveLs:  minor_disease  severe_disease 

The  same  complementary  types  of  outputs  can  be  reported  for  most  machine¬ 
learning  classification  and  prediction  approaches 


14.2  Evaluation  Strategies 

In  Chap.  7,  we  saw  an  attempt  to  categorize  the  supervised  classification  and 
unsupervised  clustering  methods.  Similarly,  Table  14.1  summarizes  the  basic 
types  of  evaluation  and  validation  strategies  for  different  forecasting,  prediction, 
ensembling,  and  clustering  techniques.  (Internal)  Statistical  Cross  Validation  or 
external  validation  should  always  be  applied  to  ensure  reliability  and  reproducibility 
of  the  results.  The  SciKit  clustering  performance  evaluation  and  Classification 
metrics  page  provide  details  about  many  alternative  techniques  and  metrics  for 
performance  evaluation  of  clustering  and  classification  methods. 


14.2.1  Binary  Outcomes 

More  details  about  binary  test  assessment  are  available  on  the  Scientific  Methods  for 
Health  Sciences  (SMHS)  EBook  site.  Table  14.2  summarizes  the  key  measures 


Table  14.1  Categories  of  clustering  validation  and  classification  evaluation  strategies 


Inference 

Outcome 

Evaluation  metrics 

Example  R  functions 

Classification 
&  Prediction 

Binary 

Accuracy,  Sensitivity,  Specific¬ 
ity,  PPV/Precision, 

NPV/Recall,  LOR 

caret : : 

confusionMatrix, 
gmodels: :CrossTable, 
cluster: : silhouette 

Classification 
&  Prediction 

Categorical 

Accuracy,  Sensitivity/Specific¬ 
ity,  PPV,  NPV,  LOR,  Silhouette 
Coefficient 

caret : : 

confusionMatrix, 
gmodels: :CrossTable, 
cluster: : silhouette 

Regression 

Modeling 

Real 

Quantitative 

correlation  coefficient,  R  , 

RMSE,  Mutual  Information, 
Homogeneity  and  Complete¬ 
ness  Scores 

cor,  metrics : :mse 
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Table  14.2  Evaluation  of  binary  (dichotomous)  statistical  tests,  classification  methods,  or  forecast¬ 
ing  predictions 


Actual  conditioner  real  class  label) 

Test  interpretation 

Absent  (H0  is  true) 

Present  (Hi  is 
true) 

Test  Result 
(Prediction 
or  Classifica¬ 
tion 

Label) 

Negative  (fail 
to  reject  H0 ) 

TN  Condition 
absent  +  Negative 
result  —  True 
(accurate)  Negative 

FNCondition  pre¬ 
sent  +  Negative 
result  =  False 
(invalid)  Negative 
Type  II  error 
(proportional  to  p) 

NPV  =  TN 

v  TN+FN 

Positive(reject 

H„) 

FP  Condition 
absent  +  Positive 
result  =  False 
Positive  Type  I 
eiTor  (a) 

TP  Condition 
Present  +  Positive 
result  —  True 
Positive 

PPV  =  Precision 

_  TP 

TP+FP 

Test 

Interpretation 

Power  =  1/i 

i  FN 

A  FN+TP 

Specificity  =  T™Fp 

Power— 

Sensitivity= 

TP 

TP+FN 

LOR  -  In  ($£) 

S  =  success, 

F  =  failure  for 

2  binary  variables, 

1  and  2 

Table  14.3  Cross-table 


Predict T 

predict F 

TRUE 

TP 

TN 

FALSE 

FP 

FN 

commonly  used  to  evaluate  the  performance  of  binary  tests,  classifiers,  or 
predictions. 

See  also  SMHS  EBook;  Power,  Sensitivity  and  Specificity  section. 


14.2.2  Confusion  Matrices 

We  talked  about  this  confusion  matrices  in  Chap.  9.  For  binary  classes,  these  will  be 

2x2  matrices.  Each  of  the  cells  has  specific  meaning,  see  the  2x2  Table  14.2 

where 

•  True  Positive(TP):  Number  of  observations  that  correctly  classified  as  “yes”  or 
“success” 

•  True  Negative(TN):  Number  of  observations  that  correctly  classified  as  “no”  or 
“failure” 

•  False  Positive(FP):  Number  of  observations  that  incorrectly  classified  as  “yes”  or 
“success” 

•  False  Negative(FN):  Number  of  observations  that  incorrectly  classified  as  “no” 
or  “failure” 
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Using  Confusion  Matrices  to  Measure  Performance 

The  way  we  calculate  accuracy  using  these  four  cells  is  summarized  by  the  following 
formula: 


accuracy  = 


TP  +  77V 

TP  +  TN  +  FP  +  FN 


TP  +  TN 

Total  number  of  observations 


On  the  other  hand,  the  error  rate,  or  proportion  of  incorrectly  classified  observa¬ 
tions,  is  calculated  using: 


FP  +  FN  FP  +  FN 

errorrate  = - == - - - . - 

TP  +  TN  +  FP  +  FN  Total  number  of  observations 

=  1  —  accuracy. 

If  we  look  at  the  numerator  and  denominator  carefully,  we  can  see  that  the  error 
rate  and  accuracy  add  up  to  1.  Therefore,  95%  accuracy  implies  a  5%  error  rate. 

In  R,  we  have  multiple  ways  to  obtain  confusion  matrices.  The  simplest  way 
would  be  to  use  table  ( ) .  For  example,  in  Chap.  8,  to  report  a  plain  2x2  table  we 
used: 


hn_ tes t_pred< - predict ( hn_ c L assifier ,  hn_ tes t ) 
tab Le( hn_ tes t_predj  hn_med_ tes t$s tage ) 

## 

##  hn_test_pred  early _stage  Later_stage 

##  early_stage  69  23 

##  Later_stage  8  0 

Then  why  did  we  use  CrossTable  ( )  function  back  in  Chapter  8?  Because  it 
reports  additional  useful  information  about  the  model  performance. 

L ibrary ( gmode Is) 

CrossTable (hn_test_predj  hn_med_test$stage ) 

##  Cell  Contents 

##  / - / 

##  /  N  I 

##  /  Chi-square  contribution  / 

##  /  N  /  Row  Total  / 

##  /  N  /  Col  Total  I 

##  /  N  /  Table  Total  / 

##  / - / 

##  Total  Observations  in  Table:  100 

##  /  hn_med_test$stage 

##  hn_test_pred  /  early_stage  /  Later_stage  /  Row  Total  / 

## . / . —i . -/ . -—I 

##  early _stage  /  69  /  23  /  92  / 

##  /  0.048  I  0.160  I  I 
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## 

/ 

0.750  / 

0.250  / 

0.920  / 

## 

/ 

0.896  / 

1.000  / 

/ 

## 

/ 

0.690  / 

0.230  / 

/ 

## 

- /— 

- - /— ■ 

- /— • 

- / 

## 

Later  stage  / 

s  / 

0  / 

s  / 

## 

/ 

0.550  / 

1.840  / 

/ 

## 

/ 

1.000  / 

0.000  / 

0.080  / 

## 

/ 

0.104  / 

0.000  / 

/ 

## 

/ 

0.080  / 

0.000  / 

/ 

## 

- /— 

. -/— ■ 

- /— • 

- / 

## 

CoLumn  Total  / 

77  / 

23  / 

100  / 

## 

/ 

0.770  / 

0.230  / 

/ 

## 

- /— 

- /— ■ 

. — /— • 

- / 

With  both  tables,  we  can  calculate  accuracy  and  error  rate  by  hand. 


accuracy<- ( 69+0)/100 
accuracy 

##  [1]  0.69 

error_rate<- (23+8) /100 
error_rate 

##  [1]  0.31 

1-accuracy 

##  [1]  0.31 


For  matrices  larger  than  2x2,  all  diagonal  elements  are  observations  that  have 
been  correctly  classified  and  off-diagonal  elements  are  those  that  have  been  incor¬ 
rectly  classified. 


14.2.3  Other  Measures  of  Performance  Beyond  Accuracy 

So  far,  we  discussed  two  performance  methods  -  table  and  cross-table.  A  third 
function  is  confusionMatrix  ( )  which  provides  the  easiest  way  to  report 
model  performance.  Notice  that  the  first  argument  is  an  actual  vector  of  the  labels , 
i.e.,  Test_Y ,  and  the  second  argument,  of  the  same  length,  represents  the  vector  of 
predicted  labels. 

This  example  was  presented  as  the  first  case-study  in  Chap.  9. 
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Library ( caret) 

qo L_pred< -predict ( qo L_mode l3  qoL_ tes t ) 

confusionMatrix(table(qol_pred3  qoi_test$cd) 3  positive=" severe_disease" ) 


##  Confusion  Matrix  and  Statistics 
## 

## 


##  qoL_pred  minor_disease  severe_disease 


## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


minor_disease 

severe_disease 

Accuracy 

95 %  Cl 

No  Information  Rate 
P-VaLue  [Acc  >  NIR] 

Kappa 

Mcnemar's  Test  P-VaLue 

Sensitivity 
Specificity 
Pos  Pred  Value 
Neg  Pred  Value 
Prevalence 
Detection  Rate 
Detection  Prevalence 
Balanced  Accuracy 

'Positive'  Class 


149  89 

74  131 


0.6321 

( 0.5853 3  0.6771) 
0.5034 
3 . 317e-08 

0.2637 

0.2728 

0.5955 

0.6682 

0.6390 

0.6261 

0.4966 

0.2957 

0.4628 

0.6318 


severe  disease 


14.2.4  The  Kappa  (k)  Statistic 

The  Kappa  statistic  was  originally  developed  to  measure  the  reliability  between  two 
human  raters.  It  can  be  harnessed  in  machine-learning  applications  to  compare  the 
accuracy  of  a  classifier,  where  one  rater  represents  the  ground  truth  (for  labeled 
data,  these  are  the  actual  values  of  each  instance)  and  the  second  rater  represents 
the  results  of  the  automated  machine-learning  classifier.  The  order  of  listing  the 
raters  is  irrelevant. 

Kappa  statistic  measures  the  possibility  of  a  correct  prediction  by  chance  alone 

and  answers  the  question  of  How  much  better  is  the  agreement  (between 
the  ground  truth  and  the  machine -learning  prediction)  than 
would  be  expected  by  chance  alone?  Its  value  is  between  0  and  1.  When 
k  =  1 ,  we  have  a  perfect  agreement  between  a  computed  prediction  (typically  the 
result  of  a  model-based  or  model-free  technique  forecasting  an  outcome  of  interest) 
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and  an  expected  prediction  (typically  random,  by  chance  prediction).  A  common 
interpretation  of  the  Kappa  statistics  includes: 

•  Poor  agreement:  less  than  0.20 

•  Fair  agreement:  0.20-0.40 

•  Moderate  agreement:  0.40-0.60 

•  Good  agreement:  0.60-0.80 

•  Very  good  agreement:  0.80-1 

In  the  above  confusionMatrix  output,  we  have  a  fair  agreement.  For  differ¬ 
ent  problems,  we  may  have  different  interpretations  of  Kappa  statistics.  To  under¬ 
stand  the  Kappa  statistic  better,  let’s  look  at  its  definition: 


kappa 


P{a)  -  P{e) 

i  -  W) 


P  ( a )  and  P  ( e )  simply  denote  probability  of  actual  and  expected  agreement 
between  the  classifier  and  true  values. 


tabLe(qoL_predj  qoL_test$cd) 

## 

##  qoL_pred  minor_disease  severe_disease 

##  minor_disease  149  89 

##  severe_disease  74  131 

According  to  above  table,  actual  agreement  is  the  accuracy: 

p_a<- (149+131 )/ (149+89+74+131 ) 
p_a 

##  [1]  0. 6320542 

The  manually  and  automatically  computed  accuracies  coincide  (0.6321).  It  may 
be  trickier  to  obtain  the  expected  agreement.  Probability  rules  tell  us  that  the 
probability  of  the  union  of  two  disjoint  events  equals  to  the  sum  of  the  individual 
(marginal)  probabilities  for  these  two  events.  Thus,  we  have: 

P{expect  agreement  for  minor -disease)  =  P{actual  type  is  minor -disease) 

+  P{predicted  type  is  minor -disease) 


Similarly: 

P  {expect  agreement  for  severe -disease)  =  P{actual  type  is  severe -disease) 

+  P  {predicted  type  is  severe -disease). 
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In  our  case: 

p_e_minor  <-  (149+74)/ (149+89+74+131) )*( (149+89)/ (149+89+74+131) 
p_e_severe  <-  ( (131+74)/ (149+89+74+131) )  *  ( (89+131)/ (149+89+74+131) ) 
p_e<-p_e_minor+p_e_severe 
p_e 

##  [1]  0.5002522 

Plugging  in  p_a  and  p_e  into  the  formula  we  get: 

kappa<-( p_a-p_e )/( l-p_e ) 
kappa 

##  [1]  8.26 

We  get  a  similar  value  as  the  confusionTable  ( )  output.  A  more  straight¬ 
forward  way  of  getting  the  Kappa  statistics  is  by  using  Kappa  ( )  function  in  the 
vcd  package. 

#install . packages (vcd) 

Library ( vcd) 

##  Loading  required  package:  grid 

Kappa ( tab Le(qo L_predj  qoi_test$cd) ) 

##  value  ASE  z  Pr(>lzl) 

##  Unweighted  0.2637  0.04573  5.767  8.071e-09 

##  Weighted  0.2637  0.04573  5.767  8.071e-09 

The  combination  of  Kappa  ( )  and  table  function  yields  a  2  x  4  matrix.  The 
Kappa  statistic  is  under  the  unweighted  value. 

Generally  speaking,  predicting  a  severe  disease  outcome  is  a  more  critical 
problem  than  predicting  a  mild  disease  state.  Thus,  weighted  Kappa  is  also  useful. 
We  give  the  severe  disease  a  higher  weight.  The  Kappa  test  result  is  not  acceptable 
since  the  classifier  may  make  too  many  mistakes  for  the  severe  disease  cases.  The 
Kappa  value  is  only  —0.0714.  Notice  that  the  range  of  Kappa  is  not  [0,1]  for  the 
weighted  Kappa. 

Kappa(tabLe(qoL_predj  qoi_test$cd) ,  weights  =  matrix(c(lJ10JlJ10) }nrow=2)) 

##  value  ASE  z  Pr(>lzl) 

##  Unweighted  0.26374  0.04573  5.767  8.071e-09 

##  Weighted  0.06818  0.04009  1.701  8.898e-02 

When  the  predicted  value  is  the  first  argument,  the  row  and  column  names 
represent  the  true  labels  and  the  predicted  labels,  respectively. 


table(qol_predj  qol_test$cd) 

## 

##  qol_pred  minor_disease  severe_disease 

##  minor_disease  149  89 

##  severe  disease  74  131 
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Summary  of  the  Kappa  Score  for  Calculating  Prediction  Accuracy 

Kappa  compares  an  Observed  classification  accuracy  (output  of  our  ML  classifier) 
with  an  Expected  classification  accuracy  (corresponding  to  random  chance  classi¬ 
fication).  It  may  be  used  to  evaluate  single  classifiers  and/or  to  compare  among  a  set 
of  different  classifiers.  It  takes  into  account  random  chance  (agreement  with  a 
random  classifier).  That  makes  Kappa  more  meaningful  than  simply  using  accuracy 
as  a  metric.  For  instance,  the  interpretation  of  an  Observed  Accuracy  of  8  0  %  is 
relative  to  the  Expected  Accuracy.  Observed  Accuracy  of  8  0%  is  more 
impactful  for  an  Expected  Accuracy  of  5  0%  compared  to  Expected  Accu¬ 
racy  of  75%. 


14.2.5  Computation  of  Observed  Accuracy  and  Expected 
Accuracy 


Consider  the  following  example  of  a  classifier  generating  the  following 
confusion  matrix.  Columns  represent  the  true  labels  and  rows  represent  the 
classifier-derived  labels  for  this  binary  prediction  example  (Table  14.4). 

In  this  example,  there  is  a  total  of  150  observations  (50  +  35  +  25  +  40).  In  reality, 
75  are  labeled  as  True  (50  +  25)  and  another  75  are  labeled  as  False  (35  +  40).  The 
classifier  labeled  85  as  True  (50  +  35)  and  the  other  65  as  False  (25  +  40). 

•  Observed  Accuracy  (OA)  is  the  proportion  of  instances  that  were 
classified  correctly  throughout  the  entire  confusion  matrix: 


OA 


50  +  40 
150 


=  0.6. 


•  Expected  Accuracy  (EA)  is  the  accuracy  that  any  random  classifier  would  be 
expected  to  achieve  based  on  the  given  confusion  matrix.  EA  is  the  propor¬ 
tion  of  instances  of  each  class  (True  and  False),  along  with  the  number  of 
instances  that  the  automated  classifier  agreed  with  the  ground  truth  label.  The  EA 
is  calculated  by  multiplying  the  marginal  frequencies  of  True  for  the  true-state 
and  the  machine  classified  instances,  and  dividing  by  the  total  number  of 
instances.  The  marginal  frequency  of  True  for  the  true-state  is  75  (50  +  25) 


Class 

True 

False 

Total 

True 

50 

35 

85 

False 

25 

40 

65 

Total 

75 

75 

150 

Table  14.4  A  simulated 
confusion  matrix. 
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and  for  the  corresponding  ML  classifier  is  85  (50  +  35).  Then,  the  expected 
accuracy  for  the  True  outcome  is: 

,  .  75  x  85 

EA(True)  = - =  42.5. 

v  ;  150 

We  similarly  compute  the  EA(False)  for  the  second,  False,  outcome,  by  using  the 
marginal  frequencies  for  the  true-state  ((False\  true  state )  =  75  =  50  +  25)  and  the 
ML  classifier  (False I  classifier )  =  65(40  +  25).  Then,  the  expected  accuracy  for  the 
True  outcome  is: 


EA(False) 


75  x  65 
150 


32.5. 


Finally,  the  EA  =  {false) 


ExpectedAccuracy(EA)  = 


42.5  +  32.5 
150 


=  0.5, 


Note  that  EA  =  0.5  whenever  the  true  -  state  binary  classification  is  balanced 
(in  reality,  the  frequencies  of  True  and  False  are  equal,  in  our  case  75). 

The  calculation  of  the  kappa  statistic  relies  on  OA  =  0.6  and  EA  =  0.5: 


(Kappa)  k 


OA-EA 
1  -  EA 


0.6 -0.5 
1  -0.5 


=  0.2. 


14.2.6  Sensitivity  and  Specificity 


If  we  take  a  closer  look  at  the  conf  us  ionMat  r  ix  ( )  output,  we  find  there  are  two 
important  statistics  “sensitivity”  and  “specificity”. 

Sensitivity,  or  true  positive  rate,  measures  the  proportion  of  “success”  observa¬ 
tions  that  are  correctly  classified. 


sensitivity  = 


TP 

TP  +  FN 


Notice  TP  +  FN  are  the  total  number  of  true  “success”  observations. 

On  the  other  hand,  specificity,  or  true  negative  rate,  measures  the  proportion  of 
“failure”  observations  that  are  correctly  classified. 


specificity 


TN 

TN  +  FP' 


Accordingly,  TN  +  FP  are  the  total  number  of  true  “failure”  observations. 
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Using  the  table  ( )  output  above  and  using  "severe_disease"  as  “success”,  we 
can  compute  these  two  measures  directly. 

sens <-131/ (131+89) 
sens 

##  [1]  0. 5954545 

spec<-149/ (149+74) 
spec 

##  [1]  0. 6681614 

Another  R  package,  caret,  also  provides  functions  to  calculate  sensitivity  and 
specificity. 

Library ( caret) 

sensitivity (qoLpredj  qol_test$cdj  positive="severe_disease" ) 

##  [1]  0.5954545 

Sensitivity  and  specificity  both  range  from  0  to  1 .  For  either  measure,  a  value  of  1 
implies  that  the  positive  and  negative  predictions  are  very  accurate.  However, 
simultaneously  high  sensitivity  and  specificity  may  not  be  attainable  in  real  world 
situations.  There  is  a  tradeoff  between  sensitivity  and  specificity.  To  compromise, 
some  studies  loosen  the  demands  on  one  and  focus  on  achieving  high  values  on  the 
other. 


14.2.7  Precision  and  Recall 


Very  similar  to  sensitivity,  precision  measures  the  proportion  of  true  “success” 
observations  among  predicted  “success”  observations. 


TP 


precision 


TP  +  FP 


Recall  is  the  proportion  of  true  “positives”  among  all  “true  positive”  conditions. 
A  model  with  high  recall  captures  most  “interesting”  cases. 


recall  = 


TP 


TP  +  FN 


Again,  let’s  calculate  these  by  hand  for  the  QoL  data: 


prec<-131/ (131+74) 
prec 

##  [1]  0.6390244 

recaLL<-131/ (131+89) 
recall 


##  [1]  0.5954545 
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Another  way  to  obtain  precision  would  be  posPredValue  ( )  under  the  caret 
package.  Remember  to  specify  which  one  is  the  “success”  class. 


posPredVaLue(qoL_predj  qoL_test$cdj  positive= "severe_disease ") 
##  [1]  0.6390244 


From  the  definitions  of  precision  and  recall,  we  can  derive  the  type  1  error  and 
type  2  errors  as  follow: 


error\  =  1  —  Precision  = 
errors  —  1  —  Recall  = 


FP 


TP +  FP 
FN 


TP +  FN 


,  and 


Thus,  we  can  compute  the  type  1  error  (0.36)  and  type  2  error  (0.40). 


errorl<-74/( 131+74) 
error2<-89/( 131+89) 
errorlj  error2 

##  [1]  0.3609756 

##  [1]  0.4045455 


14.2.8  The  F -Measure 

The  F-measure  or  FI -score  combines  precision  and  recall  using  the  harmonic  mean 
assuming  equal  weights.  High  F-score  means  high  precision  and  high  recall.  This  is  a 
convenient  way  of  measuring  model  performances  and  comparing  models. 

2  x  precision  x  recall  2  x  TP 

P  —  nnpn  Kiirp  =  -  =  - 

recall  +  precision  2  xTP  +  FP  +  FN' 

Let’s  calculate  the  FI -score  by  hand  using  the  confusion  matrix  derived  from  the 
Quality  of  Life  prediction: 


Fl<- (2*prec*recaLL )/ (prec+recoLL ) ;  FI 
##  [1]  0.6164706 

The  direct  calculations  of  the  FI -statistics  can  be  obtained  using  caret: 

precision  <-  posPredVaLue(qoi_predj  qoi_test$cdj  positive="severe_disease" ) 
recoil  <-  sensitivity (qoLpredj  qoi_test$cdj  positive="severe_disease" ) 

FI  <-  (2  *  precision  *  recoil)  /  (precision  +  recoil);  FI 


##  [1]  0.6164706 
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14.3  Visualizing  Performance  Tradeoffs  (ROC  Curve) 

Another  choice  for  evaluating  classifiers  performance  is  by  using  graphs  rather  than 
quantitative  statistics.  Graphs  are  usually  more  comprehensive  than  single  statistics. 

In  R  there  is  a  package  providing  user-friendly  functions  for  visualizing  model 
performance.  Details  can  be  found  on  the  ROCR  website. 

Here,  we  evaluate  the  model  performance  for  the  Quality  of  Life  case  study,  see 
Chap.  9. 

#install . packages ( "ROCR" ) 

Library (ROCR) 

pred<-ROCR: : prediction (predictions=pred_prob[ j  2 J,  LabeLs=qoL_test$cd) 

#  avoid  naming  collision  (ROCR: : prediction) ,  as 

#  there  is  another  prediction  function  in  neuralnet  package. 

pred_prob[,  2]  is  the  probability  of  classifying  each  observation  as 
"severe_disease".  The  above  code  saved  all  the  model  prediction  information  into 
object  pred. 

The  ROC  (Receiver  Operating  Characteristic)  curves  are  often  used  to  examine 
the  tradeoff  between  detecting  true  positives  and  avoiding  the  false  positives 
(Fig.  14.1). 


ROC  curve 


False  Positive  Rate 


Fig.  14.1  Schematic  of  quantifying  the  efficacy  of  a  classification  method  using  the  area  under  the 
ROC  curve 
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curve(Log(x),  from=0,  to=100,  xLab="FaLse  Positive  Rate",  yLab="True  Positiv 
e  Rate",  main="ROC  curve",  col="green" ,  l\Ajd=3,  axes=F) 

Axis(side=lj  at=c(0,  20 ,  40,  60,  80,  100),  Labels  =  c("0%" ,  "20%" ,  "40%",  "6 
0%",  " 80% ",  "100%")) 

Axis(side=2,  at=0:5.  Labels  =  c("0% ",  "20%",  "40%",  " 60% ",  " 80% ",  "100%")) 

segments (0,  0,  110,  5,  Lty=2,  Lwd=3) 

segments (0,  0,  0,  4.7,  Lty=2,  L\Aid=3,  coL="bLue") 

segments (0,  4.7 ,  107,  4.7,  Lty=2,  LiA)d=3,  coL="bLue") 

text (20,  4,  coL="bLue",  Labels  =  "Perfect  Classifier" ) 

text (40,  3,  col="green".  Labels  =  "Test  Classifier" ) 

text (70,  2,  col="black" ,  Labels=  "Classifier  wit/i  no  predictive  value") 

The  blue  line  in  the  above  graph  represents  the  perfect  classifier  where  we  have 
0%  false  positive  and  100%  true  positive.  The  middle  green  line  is  the  test  classifier. 
Most  of  our  classifiers  trained  by  real  data  will  look  like  this.  The  black  diagonal  line 
illustrates  a  classifier  with  no  predictive  value  predicts.  We  can  see  that  it  has  the 
same  true  positive  rate  and  false  positive  rate.  Thus,  it  cannot  distinguish  between 
the  two. 

In  terms  of  identifying  positive  value,  we  want  our  ROC  curve  to  be  as  close  to 
the  perfect  line  as  possible.  Thus,  we  measure  the  area  under  the  ROC  curve 
(abbreviated  as  AUC)  to  show  how  close  our  curve  is  to  the  perfect  classifier.  To 
do  this,  we  have  to  change  the  scale  of  the  graph  above.  Mapping  100%  to  1,  we 
have  a  1  x  1  square.  The  area  under  the  perfect  classifier  would  be  one,  and  area 
under  classifier  with  no  predictive  value  would  be  0.5.  Then,  1  and  0.5  will  be  the 
upper  and  lower  limits  for  our  model  ROC  curve.  We  have  the  following  scoring 
system  (numbers  indicate  area  under  the  curve)  for  predictive  model  ROC  curves: 

•  Outstanding:  0.9- 1.0 

•  Excellent/good:  0. 8-0.9 

•  Acceptable/fair:  0.7-0. 8 

•  Poor:  0.6-0. 7 

•  No  discrimination:  0.5-0. 6. 

Note  that  this  rating  system  is  somewhat  subjective.  Let’s  use  the  ROCR  package 
to  draw  a  ROC  curve. 


roc<- performance (pred,  measure="tpr",  x.measure="fpr") 

By  specifying  "tpr"(True  positive  rate)  and  "fpr"  (False  positive  rate)  we 
made  a  “performance”  object  (Fig.  14.2). 

plot (roc,  main="ROC  curve  for  Quality  of  Life  model",  col="blue" ,  lwd=3) 
segments (0,  0,  1,  1,  Lty=2) 

The  segments  command  draws  the  dotted  line  representing  the  classifier  with  no 
predictive  value. 

To  measure  this  quantitatively,  we  need  to  create  a  new  performance  object  with 
measure  =  "auc "  or  area  under  the  curve. 
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ROC  curve  for  Quality  of  Life  model 


Fig.  14.2  ROC  curve  of  the  prediction  of  disease  severity  using  the  quality  of  life  (QoL)  data 


roc_auc< -  performance (predj  measure="auc") 

Now  the  roc_auc  is  stored  as  a  S4  object.  This  is  quite  different  than  data  frame 
and  matrices.  First,  we  can  use  str  ( )  function  to  see  its  structure. 


str(roc_auc) 

##  Format,  class  'performance '  [package  "ROCR"]  with  6  slots 


## 

. .§  x. name 

chr  "None" 

## 

. .§  y .name 

chr  "Area  under  the  ROC  curve 

## 

. alpha . name 

chr  "none" 

## 

. .@  x. values 

List() 

## 

. .§  y . values 

List  of  1 

## 

. .  . .$  :  num  0. 65 

## 

. .§  alpha. values: 

List() 

The  ROC  object  has  six  members.  The  AUC  value  is  stored  in  y.  values.  To 
extract  that  we  use  the  @  symbol  according  to  the  output  of  the  str  ( )  function. 


roc_auc@y . values 

##  [[1]] 

##  [1]  0.6496739 

Thus,  the  obtained  AUC  =  0.65,  which  suggests  a  fair  classifier,  according  to  the 
above  scoring  schema. 
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14.4  Estimating  Future  Performance  (Internal 
Statistical  Validation) 

The  evaluation  methods  we  have  talked  about  are  all  measuring  re-substitution  error. 
That  is,  building  the  model  on  training  data  and  measuring  the  model  error  on 
separate  testing  data.  This  is  one  way  of  dealing  with  unseen  data.  First,  let’s 
introduce  the  basic  ideas,  and  more  details  will  be  presented  in  Chap.  21. 


14.4.1  The  Holdout  Method 

The  idea  is  to  partition  the  entire  dataset  into  two  separate  datasets,  using  one  of  them 
to  create  the  model  and  the  other  to  test  the  model  performances.  In  practice,  we 
usually  use  a  fraction  (e.g.,  50%,  or  |)  of  our  data  for  training  the  model,  and  reserve 
the  rest  (e.g.,  50%,  or^)  for  testing.  Note  that  the  testing  data  may  also  be  further  split 
into  proportions  for  internal  repeated  (e.g.,  cross-validation)  testing  and  final  exter¬ 
nal  (independent)  testing. 

The  partition  has  to  be  randomized.  In  R,  the  best  way  of  doing  this  is  to  create  a 
parameter  that  randomly  draws  numbers  and  use  this  parameter  to  extract  random 
rows  from  the  original  dataset.  In  Chap.  11,  we  used  this  method  to  partition  the 
Google  Trends  data. 

sub<-sampLe(nrow(googLe_norm)j  floor (nrow(googLe_norm)*0 . 75)) 
google_train<-googLe_norm[subj  ] 
google_test<-google_norm[-subj  ] 

Another  way  of  partitioning  is  by  using  createDat  aPart  it  ion  ( )  under  the 
caret  package.  Instead  of  using  the  entire  original  dataset,  we  can  use  the  outcome 
variable,  google_norm$RealEstate,  or  any  of  the  independent  variables. 

sub<-createDataPartition(googie_norm$ReaiEstateJ  p=0.75}  List  =  F) 
google_train<-googLe_norm[subj  ] 
google_test<-googLe_norm[-subj  ] 

To  make  sure  that  the  model  can  be  applied  to  future  datasets,  we  can  partition 
the  original  dataset  into  three  separate  subsets.  In  this  way,  we  have  two  subsets 
for  testing.  The  additional  validation  dataset  can  alleviate  the  probability  that  we 
have  a  good  model  due  to  chance  (non-representative  subsets).  A  common  split 
among  training,  test,  and  validation  subsets  would  be  50%,  25%,  and  25% 
respectively. 
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sub<-sample(nrow(google_norm) ,  floor (nrow(google_norm)*0. 50) ) 
googLe_train<-googLe_norm[subj  ] 
googLe_test<-googLe_norm[-subj  ] 

subl<- sample (nrow( googLe_test) ,  floor (nrow( google_test)*0. 5)) 
google_testl<-google_test[sublj  ] 
google_test2<-google_test[ -sublj  ] 
nrow( google_norm ) 

##  [1]  731 

nrow( google_train ) 

##  [1]  365 

nrow( google_testl ) 

##  [1]  183 

nrow( google_test2 ) 

##  [1]  183 

However,  when  we  only  have  a  very  small  dataset,  it’s  difficult  to  split  off  too 
much  data  as  this  reduces  the  sample  further.  There  are  the  following  two  options  for 
evaluation  of  model  performance  using  (independent)  unseen  data:  cross-validation 
and  holdout  methods.  These  are  implemented  in  the  caret  package. 


14.4.2  Cross-Validation 

For  complete  details  see  DSPA  Cross-Validation  (Chap.  21).  Below,  we  describe  the 
fundamentals  of  cross-validation  as  an  internal  statistical  validation  technique. 

This  technique  is  known  as  k-fold  cross-validation  or  k-fold  CV,  which  is  a 
standard  for  estimating  model  performance.  K-fold  CV  randomly  divides  the  orig¬ 
inal  data  into  k  separate  random  subsets  called  folds. 

A  common  practice  is  to  use  k  =  10  or  10-fold  CV  to  split  the  data  into 
10  different  subsets.  Each  time  using  one  of  the  subsets  to  be  the  test  set  and  the 
rest  to  build  the  model.  createFolds  ( )  under  caret  package  will  help  us  to  do 
so.  seet.seedO  insures  the  folds  created  are  the  same  if  you  run  the  code  line 
twice.  1234  is  just  a  random  number.  You  can  use  any  number  for  set .  seed  ( ) . 
We  use  the  normalized  Google  Trend  dataset  in  this  section. 


Library  (''caret") 
set .seed (1234) 

foids<-createFolds(googie_norm$ReaiEstateJ  k=10) 
str( folds) 

##  List  of  10 

##  $  Fold01 :  int  [1:73]  5  9  11  12  18  19  28  29  54  65  ... 

##  $  Fold02:  int  [1:73]  14  24  35  49  52  61  63  76  99  115  ... 

##  $  Fold03:  int  [1:73]  1  8  41  45  51  74  78  92  100  104  ... 
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## 

## 

## 

## 

## 

## 

## 


£  FoLd04: 
$  FoLd05: 
$  FoLd06: 
$  FoLd07: 
$  FoLd08: 
$  FoLd09: 
$  FoLdl0: 


int  [1:73]  30  32  37  40  43  57  59  64  70  96  ... 
int  [1:73]  13  16  25  53  56  68  77  81  93  95  . . . 
int  [1:73]  4  6  15  20  36  69  71  73  79  89  .. . 
int  [1:73]  34  42  44  84  90  98  102  110  112  117  ... 
int  [1:73]  2  3  48  62  82  85  86  87  88  91  ... 
int  [1:74]  10  21  23  27  33  39  46  55  58  75  .. . 
int  [1:73]  7  17  22  26  31  38  47  50  60  66  ... 


Another  way  to  cross-validate  is  to  use  cv  part  ition  ( )  in  package 
spar sedi scrim. 


#  install . packages ( "spa rsedisc rim" ) 
require(sparsediscrim) 

foLds2  =  cv_partition(l:nrow(googLe_norm)j  num_foLds=10) 

And  the  structure  of  folds  may  be  reported  by: 


str(foLds2) 


##  List  of  10 


## 

$  FoLdl  : List  of  2 

## 

.$  training:  int 

## 

.$  test  :  int 

## 

$ 

FoLd2  : List  of  2 

## 

.$  training:  int 

## 

.$  test  :  int 

## 

$ 

FoLd3  :List  of  2 

## 

.$  training:  int 

## 

.$  test  :  int 

## 

$ 

FoLd4  : List  of  2 

## 

.$  training:  int 

## 

.$  test  :  int 

## 

$ 

FoLd5  :List  of  2 

## 

.$  training:  int 

## 

.$  test  :  int 

## 

$ 

FoLd6  :List  of  2 

## 

.$  training:  int 

## 

.$  test  :  int 

## 

$ 

FoLd7  : List  of  2 

## 

.$  training:  int 

## 

.$  test  :  int 

## 

$ 

FoLd8  :List  of  2 

## 

.$  training:  int 

## 

.$  test  :  int 

## 

$ 

FoLd9  : List  of  2 

## 

.$  training:  int 

## 

.$  test  :  int 

## 

$ 

FoLdl0 : List  of  2 

## 

.$  training:  int 

## 

.$  test  :  int 

[1:657]  4  5  6  8  9  10  11  12  16  17  ..  . 

[1:74]  287  3  596  1  722  351  623  257  568  414  ... 

[1:658]  12356789  10  11  ... 

[1:73]  611  416  52  203  359  195  452  258  614  121  ... 

[1:658]  12345789  10  11  ... 

[1:73]  182  202  443  152  486  229  88  158  178  293  ... 

[1:658]  123456789  10... 

[1:73]  646  439  362  481  183  387  252  520  438  586  ... 

[1:658]  123456789  10... 

[1:73]  503  665  47  603  348  125  719  11  461  361  ... 

[1:658]  1  2  3  4  6  7  9  10  11  12  .. . 

[1:73]  666  411  159  21  565  298  537  262  131  600  ... 

[1:658]  123456789  10... 

[1:73]  269  572  410  488  124  447  313  255  360  473  ... 
[1:658]  123456789  11  ... 

[1:73]  446  215  256  116  592  284  294  300  402  455  ... 

[1:658]  123456789  10... 

[1:73]  25  634  717  545  76  378  53  194  70  346  ... 

[1:658]  12345678  10  11  ... 

[1:73]  468  609  40  101  595  132  248  524  376  618  ... 


Now,  we  have  10  different  subsets  in  the  folds  object.  We  can  use  lapply  ( ) 
to  fit  the  model.  90%  of  data  will  be  used  for  training  so  we  use  [  — x ,  ]  to  represent 
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all  observations  not  in  a  specific  fold.  In  Chap.  11  we  showed  building  a  neutral 
network  model  for  the  Google  Trends  data.  We  can  do  the  same  for  each  fold 
manually;  train,  test,  aggregate  the  results,  and  report  the  agreement  (correlations 
between  the  predicted  and  observed  RealEstate  values). 


Library ( neuraLnet ) 

foLd_cv<-LappLy(foLdSj  f unction (x){ 
googLe_train<-googLe_norm[ -Xj  ] 
googLe_test<-googLe_norm[Xj  ] 

goog Le_modeL<- neuraLnet (Rea LEstate~UnempLoyment+RentaL+Mortgage+Jobs+Inves 
ting+DJI_Index+StdDJIj  data=goog Le_train ) 

goog Le_pred<- compute ( goog Le_modeLj  googLe_test[ j  c(l:2,  4:8)]) 

pred_resuLts<-googLe_pred$net . resuLt 

pred_cor< -cor (goog Le_test$ReaLEstatej  pred_resuLts) 

return ( pred_ cor ) 

}) 

str(foLd_cv) 

##  List  of  10 

##  $  FoLddl:  num  [1}  1]  0.977 

##  $  FoLd02:  num  [1}  1]  0.97 

##  $  FoLd03:  num  [1}  1]  0.972 

##  $  FoLd04:  num  [1,  1]  0.979 

##  $  FoLd05:  num  [1}  1]  0.976 

##  $  FoLd06:  num  [1}  1]  0.974 

##  $  FoLd07:  num  [1}  1]  0.971 

##  $  FoLd08:  num  [1}  1]  0.982 

##  $  FoLd09:  num  [1}  1]  -0.516 

##  $  FoLdl0:  num  [1}  1]  0.974 

From  the  output,  we  know  that  in  most  of  the  folds  the  model  predicts  very  well. 
In  a  typical  run,  one  fold  may  yield  bad  results.  We  can  use  the  mean  of  these 
10  correlations  to  represent  the  overall  model  performance.  But  first,  we  need  to  use 
unlist  ()  function  to  transform  fold  cv  into  a  vector. 


mean(unList(foLd_cv) ) 

##  [1]  0.8258223801 

This  correlation  is  high,  suggesting  strong  association  between  predicted  and  true 
values.  Thus,  the  model  is  very  good  in  terms  of  its  prediction. 


14.4.3  Bootstrap  Sampling 

The  second  method  is  called  bootstrap  sampling.  In  k-fold  CV,  each  observation  can 
only  be  used  once.  However,  bootstrap  sampling  is  a  sampling  process  with  replace¬ 
ment.  Before  selecting  a  new  sample,  it  recycles  every  observation  so  that  each 
observation  could  appear  in  multiple  folds. 


14.5  Assignment:  14.  Evaluation  of  Model  Performance 
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A  very  special  setting  of  bootstrap  uses  at  each  iteration  63.2%  of  the  original  data 
as  our  training  dataset  and  the  remaining  36.8%  as  the  test  dataset.  Thus,  compared 
to  k-fold  CV,  bootstrap  sampling  is  less  representative  of  the  full  dataset.  A  special 
case  of  bootstrapping,  0.632  bootstrap ,  addresses  this  issue  by  changing  the  final 
performance  metric  using  the  following  formula: 


error  =  0.632  x  error test  +  0.368  x  error train . 


This  synthesizes  the  optimistic  model  performance  on  training  data  with  the 
pessimistic  model  performance  on  test  data  by  weighting  the  corresponding  errors. 
This  method  is  extremely  good  for  small  samples. 

To  see  the  rationale  behind  0.632  bootstrap ,  consider  a  standard  training  set  T  of 
cardinality  n ,  where  our  bootstrap  sampling  generates  m  new  training  sets  Tb  each  of 
size  n' .  Sampling  from  T  is  uniform  with  replacement ,  suggests  that  some  observa¬ 
tions  may  be  repeated  in  each  sample  Tt.  Suppose  the  size  of  the  sub-samples  are  of 
the  same  order  as  T,  i.e.,  n!  —  n ,  then  for  large  n  the  sample  Dt  is  expected  to  have 
(l  —  £)  ~  0.632  unique  cases  from  the  complete  original  collection  T,  the  remaining 
proportion  0.368  are  expected  to  be  repeated  duplicates.  Hence,  the  name  0.632 
bootstrap  sampling.  In  general,  for  large  n  ^  n' ,  the  sample  Dt  is  expected  to  have 

n  ^1  —  e~ ^  unique  cases,  see  On  Estimating  the  Size  and  Confidence  of  a  Statistical 
Audit). 

Having  the  bootstrap  samples,  the  m  models  can  be  fitted  (estimated)  and 
aggregated,  e.g.,  by  averaging  the  outputs  (for  regression)  or  by  using  voting 
methods  (for  classification).  We  will  discuss  this  more  in  later  chapters. 

Try  to  apply  the  same  techniques  to  some  of  the  other  data  in  the  list  of  Case- 
Studies. 


14.5  Assignment:  14.  Evaluation  of  Model  Performance 

The  ABIDE  dataset  includes  imaging,  clinical,  genetics  and  phenotypic  data  for  over 

1000  pediatric  cases  -  Autism  Brain  Imaging  Data  Exchange  (ABIDE). 

•  Apply  C5.0  to  predict  on  part  of  data  (training  data). 

•  Evaluate  the  model’s  performance,  using  confusion  matrices,  accuracy,  x, 
precision,  and  recall,  F-measure,  etc. 

•  Explain  and  compare  each  evaluation. 

•  Use  the  ROC  to  examine  the  tradeoff  between  detecting  true  positives  and 
avoiding  the  false  positives  and  report  AUC. 

•  Finally,  apply  cross  validation  on  C5.0  and  report  the  CV  error. 

•  You  may  apply  the  same  analysis  workflow  to  evaluate  the  performance  of 
alternative  methods  (e.g.,  KNN,  SVM,  LDA,  QDA,  Neural  Networks,  etc.) 
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Chapter  15 

Improving  Model  Performance 


® 

Check  for 
updates 


We  already  explored  several  alternative  machine  learning  (ML)  methods  for  predic¬ 
tion,  classification,  clustering,  and  outcome  forecasting.  In  many  situations,  we 
derive  models  by  estimating  model  coefficients  or  parameters.  The  main  question 
now  is  How  can  we  adopt  the  advantages  of  crowdsourcing  and  biosocial  network¬ 
ing  to  aggregate  different  predictive  analytics  strategies?  Are  there  reasons  to 
believe  that  such  ensembles  of  forecasting  methods  may  actually  improve  the 
performance  (e.g.,  increase  prediction  accuracy)  of  the  resulting  consensus  meta¬ 
algorithm?  In  this  chapter,  we  are  going  to  introduce  ways  that  we  can  search  for 
optimal  parameters  for  a  single  ML  method,  as  well  as  aggregate  different  methods 
into  ensembles  to  enhance  their  collective  performance  relative  to  any  of  the 
individual  methods  part  of  the  meta-aggregate. 

After  we  summarize  the  core  methods,  we  will  present  automated  and  customized 
parameter  tuning,  and  show  strategies  for  improving  model  performance  based  on 
meta-learning  via  bagging  and  boosting. 


15.1  Improving  Model  Performance  by  Parameter  Tuning 

One  of  the  methods  for  improving  model  performance  relies  on  tuning ,  which  is  the 
process  of  searching  for  the  best  parameters  for  a  specific  method.  Table  15.1 
summarizes  the  parameters  for  each  method  we  covered  in  previous  chapters. 


15.2  Using  caret  for  Automated  Parameter  Tuning 

In  Chap.  7,  we  used  KNN  and  plugged  in  random  k  parameters  for  the  number  of 
clusters.  This  time,  we  will  test  multiple  k  values  simultaneously  and  pick  the  one 
with  the  highest  accuracy.  When  using  the  caret  package,  we  need  to  specify  a 
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Table  15.1  Synopsis  of  the  basic  prediction,  classification  and  clustering  methods  and  their  core 
parameters 


Model 

Learning  task 

Method 

Parameters 

KNN 

Classification 

class : : knn 

data,  k 

K-Means 

Classification 

stats : : kmeans 

data,  k 

Naive  Bayes 

Classification 

el071 : : 
naiveBayes 

train,  class,  laplace 

Decision  Trees 

Classification 

C50: : C5 . 0 

train,  class,  trials, 
costs 

OneR  Rule 

Learner 

Classification 

RWeka : : OneR 

class-predictors,  data 

RIPPER  Rule 
Learner 

Classification 

RWeka : : JRip 

formula,  data,  subset, 
na. action,  control, 
options 

Linear  Regression 

Regression 

stats : : lm 

formula,  data,  subset, 
weights,  na. action, 
method 

Regression  Trees 

Regression 

rpart : : rpart 

dep  var  -  indep  var,  data 

Model  Trees 

Regression 

RWeka : : M5P 

formula,  data,  subset, 
na. action,  control 

Neural  Networks 

Dual  use 

nnet : : nnet 

x,  y,  weights,  size,  Wts, 
mask,linout,  entropy, 
softmax,  censored,  skip, 
rang,  decay,  maxit,  Hess, 
trace,  MaxNWts,  abstol, 
reltol 

Support  Vector 
Machines  (Poly¬ 
nomial  Kernel) 

Dual  use 

caret : : train : : 
svmLinear 

C 

Support  Vector 
Machines  (Radial 
Basis  Kernel) 

Dual  use 

caret : : train : : 
svmRadial 

C,  sigma 

Support  Vector 

Machines 

(general) 

Dual  use 

kernlab : : ksvm 

formula,  data,  kernel 

Random  Lorests 

Dual  use 

randomForest : : 

randomForest 

formula,  data 

class  variable,  a  dataset  containing  a  class  variable,  predicting  features,  and  the 
method  we  will  be  using.  In  Chap.  7,  we  used  the  Boys  Town  Study  of  Youth 
Development  dataset,  normalized  all  the  features,  stored  them  in  boystown_n,  and 
formulated  the  outcome  class  variable  first  (boystown$grade). 
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str(boystown_n ) 
##  'data. frame' : 


200  obs.  of  10  variabLes : 


## 

* 

sex 

num 

0  0 

00110011  ... 

## 

* 

gpa 

num 

1  0 

0.6  0.4  0.6  0.6  0.2  1  0.2  0.6  ... 

## 

* 

Alcoholuse : 

num 

0.182  0.364  0.182  0.182  0.545  ... 

## 

* 

alcatt 

num 

0.5 

0.333  0.5  0.167  0.333  . . . 

## 

* 

dad job 

num 

1  1 

11111111  ... 

## 

* 

mom job 

num 

0  0 

00100011  ... 

## 

$ 

dad close 

num 

0.143  0.429  0.286  0.143  0.286  ... 

## 

$ 

momclose 

num 

0.143  0.571  0.286  0.286  0.143  ... 

## 

$ 

Larceny 

num 

0.25  0  0  0.75  0.25  0  0  0  0.25  0.25  .. . 

## 

$ 

vandalism  : 

num 

0.429  0  0.286  0.286  0.286  ... 

boystoiAjn_n<-cbind(boystoiAjn_ 

n,  boysto\Ain[,  11]) 

str(boystown_n ) 

## 

'data. frame ' : 

200  obs 

;.  of  11  variabLes : 

## 

$ 

sex 

• 

• 

num 

0000110011  ... 

## 

$ 

gpa 

• 

• 

num 

1  0  0.6  0.4  0.6  0.6  0.2  1  0.2  0.6  ... 

## 

$ 

Alcoholuse 

• 

• 

num 

0.182  0.364  0.182  0.182  0.545  ... 

## 

$ 

alcatt 

• 

• 

num 

0.5  0.333  0.5  0.167  0.333  ... 

## 

$ 

dad job 

• 

• 

num 

1111111111  ... 

## 

$ 

mom job 

• 

• 

num 

0000100011  ... 

## 

$ 

dadclose 

• 

• 

num 

0.143  0.429  0.286  0.143  0.286  ... 

## 

$ 

momclose 

• 

• 

num 

0.143  0.571  0.286  0.286  0.143  ... 

## 

$ 

Larceny 

0 

• 

num 

0.25  0  0  0.75  0.25  0  0  0  0.25  0.25  .. . 

## 

$ 

vandalism 

• 

• 

num 

0.429  0  0.286  0.286  0.286  ... 

## 

$ 

boystown[. 

11]: 

Factor  w/  2  Levels  "above_avg", "avg_or_below" : 

2  2 

1 

212... 

co L names ( boys town_n ) [11] <-  'grade " 


The  dataset  including  a  specific  class  variable  and  predictive  features  is  now 
successfully  created.  We  are  using  the  KNN  method  as  an  example  with  the  class 
variable  grade.  So,  we  plug  this  information  into  the  caret :  :  train  ( )  func¬ 
tion.  Note  that  caret  is  using  the  full  dataset  because  it  will  automatically  do  the 
random  sampling  for  us.  To  make  the  results  reproducible,  we  utilize  the  set .  seed 
( )  function  that  we  previously  used,  see  Chap.  14. 


Library ( caret) 
set.seed(123) 

m<-train(grade~. ,  data=boystovjn_n,  metbod="knn" ) 
m;  summary (m) 

##  k- Nearest  Neighbors 
## 

##  200  samples 
##  10  predictor 

##  2  classes:  ' above_avg  ' ,  ' avg_or_beloiAj ' 

## 

##  No  pre-processing 

##  Resampling :  Bootstrapped  (25  reps) 

##  Summary  of  sample  sizes:  200 ,  200 ,  200,  200,  200,  200,  ... 
##  Resampling  results  across  tuning  parameters : 

## 
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## 

k 

Accuracy 

Kappa 

## 

5 

0.7952617 

0.5193402 

## 

7 

0.8143626 

0.5585191 

## 

## 

9 

0.8070520 

0.5348281 

##  Accuracy  was  used  to  select  the  optimal  model  using  the  Largest  value. 
##  The  final  value  used  for  the  model  was  k  =  7. 


##  Length 

##  Learn  2 

##  k  1 

##  theDots  0 

##  x Names  10 

##  problemType  1 

##  tuneValue  1 

##  obsLevels  2 


Class  Mode 

-none-  List 

-none-  numeric 

-none-  List 

-none-  character 

-none-  character 

data. frame  List 
-none-  character 


In  this  case,  using  str  (m)  to  summarize  the  object  m  may  report  out  too  much 
information.  Instead,  we  can  simply  type  the  object  name  m  to  get  more  concise 
information  about  it. 

1.  Description  about  the  dataset:  number  of  samples,  features,  and  classes. 

2.  Re-sampling  process:  here,  we  use  25  bootstrap  samples  with  200  observations 
(same  size  as  the  observed  dataset)  each  to  train  the  model. 

3.  Candidate  models  with  different  parameters  that  have  been  evaluated:  by  default, 
caret  uses  3  different  choices  for  each  parameter,  but  for  binary  parameters,  it 
only  allows  two  choices,  TRUE  and  FALSE).  As  KNN  has  only  one  parameter  k , 
we  have  three  candidate  models  reported  in  the  output  above. 

4.  Optimal  model:  the  model  with  largest  accuracy  is  the  one  corresponding  to  k=5. 

Let’s  see  how  accurate  this  “optimal  model”  is  in  terms  of  the  re- substitution 
error.  Again,  we  will  use  the  predict  ( )  function  specifying  the  object  m  and  the 
dataset  boystown_n.  Then,  we  can  report  the  contingency  table  showing  the 
agreement  between  the  predictions  and  real  class  labels. 


set. seed (1234) 
p<-predict(mj  boystown_n) 
table (pj  boystown_n$grade) 

## 

##  p  above_avg  avg_or_below 

##  above_avg  132  17 

##  avg_or_below  2  49 

This  model  has  (17  +  2)/200  =  0.09  re-substitution  error  (9%).  This  means  that  in 
the  200  observations  that  we  used  to  train  this  model,  9 1  %  of  them  were  correctly 
classified.  Note  that  re-substitution  error  is  different  from  accuracy.  The  accuracy  of 
this  model  is  0.8,  which  is  reported  by  a  model  summary  call.  As  mentioned  in 
Chap.  14,  we  can  obtain  prediction  probabilities  for  each  observation  in  the  original 
boystown_n  dataset. 
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head (predict (m,  boystown_nj  type  =  "prob")) 


##  above_avg 
##  1  0.0000000 
##  2  1.0000000 
##  3  0.7142857 
##  4  0.8571429 
##  5  0.2857143 
##  6  0.5714286 


avg_or_beLovj 
1 . 0000000 
0.0000000 
0.2857143 
0.1428571 
0.7142857 
0.4285714 


15.2.1  Customizing  the  Tuning  Process 

The  default  setting  of  train  ( )  might  not  meet  the  specific  needs  for  every  study. 
In  our  case,  the  optimal  k  might  be  smaller  than  5.  The  caret  package  allows  us  to 
customize  the  settings  for  train  () . 

caret :  :  trianControl  ( )  can  help  us  to  customize  re-sampling  methods. 
There  are  6  popular  re-sampling  methods  that  we  might  want  to  use  in  the  following 
table  (Table  15.2). 

These  methods  are  helping  us  find  representative  samples  to  train  the  model.  Let’s 
use  0.632  bootstrap  for  example.  Just  specify  method=  "boot 632 "  in  the 
trainControl  ( )  function.  The  number  of  different  samples  to  include  can  be 
customized  by  number  =  option.  Another  option  in  trainControl  ( )  is  about 
the  model  performance  evaluation.  We  can  change  our  preferred  method  of  evalu¬ 
ation  to  select  the  optimal  model.  The  oneSE  method  chooses  the  simplest  model 
within  one  standard  error  of  the  best  performance  to  be  the  optimal  model.  Other 
methods  are  also  available  in  caret  package.  For  detailed  information,  type  best 
in  R  console. 

We  can  also  specify  a  list  of  k  values  we  want  to  test  by  creating  a  matrix  or  a  grid. 

ctrL<-trainControL ( method= "boot632 ",  number=25j  se Lection Function= "oneSE") 
grid< -expand. grid( .k=c(lj  3 ,  5,  7,  9)) 

#  Creates  a  data  frame  from  all  combinations  of  the  supplied  factors 


Table  15.2  Six  complementary  methods  for  customizing  the  caret:  :  trainControl  ( ) 
re-sampling 


Resampling  method 

Method  name 

Additional  options  and  default  values 

Holdout  sampling 

LGOCV 

p  =  0.75  (training  data  proportion) 

k-fold  cross-validation 

cv 

number  =10  (number  of  folds) 

Repeated  k-fold  cross  validation 

repeatedcv 

number  =10  (number  of  folds), 
repeats  =  10  (number  of  iterations) 

Bootstrap  sampling 

boot 

number  =  25  (resampling  iterations) 

0.632  bootstrap 

boot632 

number  =  25  (resampling  iterations) 

Leave-one-out  cross-validation 

LOOCV 

None 
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Usually,  to  avoid  ties,  we  prefer  to  choose  an  odd  number  of  clusters  k.  Now  the 
constraints  are  all  set.  We  can  start  to  select  models  again  using  train  () . 


set .seed (123) 

m<-train(grade~. ,  data=boystown_n,  method="knn" , 
metric=" Kappa" , 
trControl=ctrl, 
tuneGrid=grid) 
m 

##  k -Nearest  Neighbors 
## 

##  200  samples 
##  10  predictor 

##  2  classes:  '  above_avg ' ,  '  avg_or_below ' 

## 

##  No  pre-processing 

##  Resampling :  Bootstrapped  (25  reps) 

##  Summary  of  sample  sizes:  200,  200,  200,  200,  200,  200,  ... 
##  Resampling  results  across  tuning  parameters : 


## 

## 

k 

Accuracy 

Kappa 

## 

1 

0.8726660 

0.7081751 

## 

3 

0.8457584 

0.6460742 

## 

5 

0.8418226 

0.6288675 

## 

7 

0.8460327 

0.6336463 

## 

## 

9 

0.8381961 

0.6094088 

##  Kappa  was  used  to  select  the  optimal  model  using  the  one  SE  rule. 

##  The  final  value  used  for  the  model  was  k  =  1. 

Here  we  added  metric= "  Kappa "  to  include  the  Kappa  statistics  as  one  of  the 
criteria  to  select  the  optimal  model.  We  can  see  the  output  accuracy  for  all  the 
candidate  models  are  better  than  the  default  bootstrap  sampling.  The  optimal  model 
has  k=3 ,  a  high  accuracy  0.846,  and  a  high  Kappa  statistic,  which  is  much  better  than 
the  model  we  had  in  Chap.  7.  As  you  can  see  from  the  output,  the  SE  rule  no  longer 
choses  the  model  with  the  highest  accuracy  or  Kappa  statistic  to  be  the  “optimal 
model”.  It  is  a  more  comprehensive  method  than  only  looks  at  one  statistic  or  a 
single  quality  measure. 


15.2.2  Improving  Model  Performance  with  Meta-learning 

Meta-learning  involves  building  multiple  learners  (can  be  single  or  multiple  learning 
algorithms)  at  the  same  time.  It  combines  the  output  from  these  learners  and 
generates  more  effective  meta-classifiers. 
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To  decrease  the  variance  (bagging)  or  bias  (boosting),  random  forests  attempt  in 
two  steps  to  correct  the  general  decision  trees’  trend  to  overfit  the  model  to  the 
training  set: 

1 .  Producing  a  distribution  of  simple  ML  models  on  subsets  of  the  original  data. 

2.  Combining  the  distribution  into  one  “aggregated”  model. 

Before  stepping  into  the  details,  let’s  briefly  summarize: 

•  Bagging  (stands  for  Bootstrap  Aggregating)  is  a  way  to  decrease  the  variance  of 
your  prediction  by  generating  additional  data  for  training  from  your  original 
dataset.  It  generates  multiple  sets  of  the  same  cardinality/size  as  your  original 
data,  as  combinations  with  repetitions.  By  increasing  the  size  of  your  training  set 
you  can’t  improve  the  model  predictive  force,  but  just  decrease  the  variance, 
narrowly  tuning  the  prediction  to  the  expected  outcome. 

•  Boosting  is  a  two-step  approach,  where  one  first  uses  subsets  of  the  original  data 
to  produce  a  series  of  moderately  performing  models  and  then  “boosts”  their 
performance  by  combining  them  together  using  a  particular  cost  function  (e.g., 
Accuracy).  Unlike  bagging,  in  classical  boosting,  the  subset  creation  is  not 
random  and  depends  upon  the  performance  of  the  previous  models:  every  new 
subset  contains  the  elements  that  were  (likely  to  be)  misclassified  by  previous 
models.  Usually,  we  prefer  weaker  classifiers  in  boosting.  For  example,  a  prev¬ 
alent  choice  is  to  use  stump  (level-one  decision  tree)  in  AdaBoost  (Adaptive 
Boosting). 


15.2.3  Bagging 

One  of  the  most  well-known  meta-learning  method  is  bootstrap  aggregating  or 
bagging.  It  builds  multiple  models  with  bootstrap  samples  using  a  single  algorithm. 
The  models’  predictions  are  combined  with  voting  (for  classification)  or  averaging 
(for  numeric  prediction).  Voting  means  that  bagging  model’s  prediction  is  based  on 
the  majority  of  learners’  predictions  for  a  class.  Bagging  is  especially  good  with 
unstable  learners  like  decision  trees  or  SVM  models. 

To  illustrate  the  Bagging  method,  we  will  again  use  the  Quality  of  Life  and 
chronic  disease  dataset  in  Chap.  9.  Just  like  we  did  in  the  second  practice  problem  in 
Chap.  1 1,  we  will  use  CHARLSONSCORE  as  the  classes  labels,  which  has  1 1  different 
class  labels. 

qoL<-read.  csv (" https : //umich .  instructure.  com/fiLes/481332/downLoacl?clownLoacl_ 
frd=l ") 

qoL<-qoL[ ! qoL$CHARLS0NSC0RE==-9  ,  -c(l}  2)] 
qoL$CHARLSONSCORE< -as .factor (qoL$CHARLSONSCORE ) 

To  apply  bagging  ( ) ,  we  need  to  download  the  ipred  package  first.  After 
loading  the  package,  we  build  a  bagging  model  with  CHARLSONSCORE  as  class 
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label  and  all  other  variables  in  the  dataset  as  predictors.  We  can  specify  the  number 
of  voters  (decision  tree  models  we  want  to  have),  the  default  number  is  25. 


#  install. packages("ipred") 

Library ( ipred) 
set.seed(123) 

my bag< -bagging (CHARLSONSCORE~. ,  data=qoL,  nbagg=25) 

Next,  we  shall  use  the  predict  ( )  function  to  apply  this  model  for  prediction. 
For  evaluation  purposes,  we  create  a  table  reporting  the  re-substitution  error. 


bt_pred< -predict (mybag,  qoL) 
agreement<-bt_pred==qoL$CHARLSONSCORE 
prop . tab Le( tab Le( agreemen t)) 

##  agreement 

##  FALSE  TRUE 

##  0.001718213  0.998281787 

This  model  works  very  well  with  its  training  data.  It  labeled  99.8%  of  the  cases 
correctly.  To  see  its  performances  on  feature  data,  we  apply  the  caret  train  ( ) 
function  again  with  10  repeated  CV  as  re-sampling  method.  In  caret,  bagged  trees 
method  is  called  treebag. 


Library ( caret) 
set.seed(123) 

ctrL<-trainControL(method="repeatedcv",  number  =  10 ,  repeats  =10) 

train (CHARLSONSCORE~. ,  data=as. data. frame (qoL),  method="treebag",  trControL= 

ctrL  ) 

##  Bagged  CART 
## 

##  2328  sampLes 
##  38  predictor 

##  11  cLasses:  ' 0 ',  '1 ',  '2',  ' 3 ',  '4',  '5',  ' 6 '7',  ' 8 '9',  '10' 

## 

##  No  pre-processing 

##  ResampLing :  Cross-VaLidated  (10  foLdj  repeated  10  times) 

##  Summary  of  sampLe  sizes:  2095 }  2096,  2093,  2094,  2098,  2097,  ... 

##  ResampLing  resuLts: 

## 

##  Accuracy  Kappa 

##  0.5234615  0.2173193 

We  got  an  accuracy  of  52%  and  a  fair  Kappa  statistics.  This  result  is  better  than 
our  previous  prediction  attempt  in  Chap.  11  using  the  ksvm  ( )  function  alone 
(~50%).  Here,  we  combined  the  prediction  results  of  38  decision  trees  to  get  this 
level  of  prediction  accuracy. 

In  addition  to  decision  tree  classification,  caret  allows  us  to  explore  alternative 
bag  ( )  functions.  For  instance,  instead  of  bagging  based  on  decision  trees,  we  can 
bag  using  an  SYM  model,  caret  provides  a  nice  setting  for  SVM  training,  making 
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predictions  and  counting  votes  in  a  list  object  svmBag.  We  can  examine  these 
objects  by  using  the  str  ( )  function. 

str(svmBag) 

##  List  of  3 

##  $  fit  : function  (x}  y}  ...) 

##  $  pred  : function  (object ,  x) 

##  $  aggregate : function  (x}  type  =  "class") 

Clearly,  fit  provides  the  training  functionality,  pred  the  prediction  and  fore¬ 
casting  on  new  data,  and  aggregate  is  a  way  to  combine  many  models  and 
achieve  voting-based  consensus.  Using  the  member  operator,  the  $  sign,  we  can 
explore  these  three  types  of  elements  of  the  svmBag  object.  For  instance,  the  fit 
element  may  be  extracted  from  the  SVM  object  by: 

svmBag$fit 

##  function  (x}  y}  . . .) 

##  { 

##  LoadNamespace( "kerniab") 

##  out  <-  kerniab: :ksvm(as .matrix(x) j  y ,  prob .model  =  is .factor(y) } 

##  . . . ; 

##  out 

##  } 

##  <environment :  namespace : caret > 

fit  relies  on  the  ksvm  ( )  function  in  the  kerniab  package,  which  means  this 
package  needs  to  be  loaded.  The  other  two  methods,  pred  and  aggregate,  may 
be  explored  in  a  similar  way.  They  just  follow  the  SVM  model  building  and  testing 
process  we  discussed  in  Chap.  11. 

This  svmBag  object  could  be  used  as  an  optional  setting  in  the  train  ( ) 
function.  However,  this  option  requires  that  all  features  are  linearly  independent 
with  trivial  covariances,  which  may  be  rare  in  real  world  data. 


15.2.4  Boosting 

Bagging  uses  equal  weights  for  all  learners  we  included  in  the  model.  Boosting  is 
quite  different  in  terms  of  weights.  Suppose  we  have  the  first  learner  correctly 
classifying  60%  of  the  observations.  This  60%  of  data  may  be  less  likely  to  be 
included  in  the  training  dataset  for  the  next  learner.  So,  we  have  more  learners 
working  on  “hard-to-classify”  observations. 

Mathematically,  we  are  using  a  weighted  sum  of  functions  to  predict  the  outcome 
class  labels.  We  can  try  to  fit  the  true  model  using  weighted  additive  modeling.  We 
start  with  a  random  learner  that  can  classify  some  of  the  observations  correctly, 
possibly  with  some  errors. 


506 


15  Improving  Model  Performance 


y\ 


This  / 1  is  our  first  learner  and  j>  1  denotes  its  predictions  (this  equation  is  in  matrix 
form).  Then,  we  can  calculate  the  residuals  of  our  first  learner. 


£i  =y-v i  xyl9 


where 

with 

e; 


N 
i=  1 


Vi  is  a  shrinkage  parameter  to  avoid  overfitting.  Next,  we  fit  the  residual 
another  learner.  This  learner  minimizes  the  following  function 


I  yi  Lk—  i  lk 


,  here  k=2.  Then  we  obtain  a  second  model  l2  with: 


After  that,  we  can  update  the  residuals: 


c2  —  Cl  —  V2  x  y  2‘ 

We  repeat  this  residual  fitting  until  adding  another  learner  lk  results  in  an  updated 
residual  ek  that  is  smaller  than  a  small  predefined  threshold.  At  the  end,  we  will  have 
an  additive  model  like: 


L  —  v\  x  l\  +  v2  x  l2  T- . . .  Vk  x  lx, 

where  we  have  k  weak  learners,  but  a  very  strong  ensemble  model. 

Schapire  and  Freund  found  that  although  individual  learners  trained  on  the  pilot 
observations  might  be  very  weak  in  predicting  in  isolation,  boosting  the  collective 
power  of  all  of  them  is  expected  to  generate  a  model  no  worse  than  the  best 
of  all  individual  constituent  models  included  in  the  boosting  ensem¬ 
ble.  Usually,  the  boosting  results  are  quite  a  bit  better  than  the  best  single  model. 

Boosting  can  be  used  for  almost  all  models.  Most  commonly,  it  is  applied  to 
decision  trees. 


15.2.5  Random  Forests 

Random  forests,  or  decision  tree  forests,  represent  a  boosting  method  focusing  on 
decision  tree  learners. 


Training  Random  Forests 

One  approach  to  train  and  build  random  forests  relies  on  using  randomForest  ( ) 
under  the  randomForest  package.  It  has  the  following  components: 


m<-randomForest (expression,  data,  ntree=500,  mtry=sqrt (p) ) 
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•  expression :  the  class  variable  and  features  we  want  to  include  in  the  model. 

•  data :  training  data  containing  class  and  features. 

•  ntree\  number  of  voting  decision  trees. 

•  mtry :  optional  integer  specifying  the  number  of  features  to  randomly  select  at 
each  split.  The  p  stands  for  number  of  features  in  the  data. 

Let’s  build  a  random  forest  using  the  Quality  of  Life  dataset. 


#  install . packages ( "randomForest" ) 

Library ( randomForest) 

set.seed(123) 

rf< -randomForest ( CHARLSONSCORE^ . data=qoi ) 
rf 


## 


##  Call: 


##  randomForest (formuLa  =  CHARLSONSCORE  ~  . data  =  qoL) 
##  Type  of  random  forest:  classification 

##  Number  of  trees:  500 

##  No.  of  variables  tried  at  each  split:  6 


## 

##  OOB  estimate  of  error  rate:  46.13% 

##  Confusion  matrix: 


##  0  1 

##  0  574  301 

##  1  305  678 

##  2  90  185 

##  3  25  101 

##  4  5  19 

##  5  3  4 

##  6  14 

##  7  11 

##  8  7  8 

##  9  3  5 

##10  1  1 


2  3  4  5  6  7  8 
2  0  0  0  0  0  0 
1  0  0  0  0  0  0 
2  0  0  0  0  0  0 
1  0  0  0  0  0  0 
0  0  0  0  0  0  0 
0  0  0  0  0  0  0 
0  0  0  0  0  0  0 
0  0  0  0  0  0  0 
0  0  0  0  0  0  0 
0  0  0  0  0  0  0 
0  0  0  0  0  0  0 


9  10  class . error 
0  0  0.3454960 
0  0  0.3109756 
0  0  0.9927798 
00  1 . 0000000 
00  1 . 0000000 
00  1. 0000000 
00  1. 0000000 
00  1. 0000000 
00  1. 0000000 
00  1. 0000000 
00  1. 0000000 


By  default  the  model  contains  500  decision  trees  and  tried  6  variables  at  each 
split.  Its  OOB,  or  out-of-bag,  error  rate  is  about  46%,  which  corresponds  to  a  poor 
accuracy  rate  (54%).  Note  that  the  OOB  error  rate  is  not  re-substitution  error.  The 
confusion  matrix  next  to  it  is  reflecting  OOB  error  rate  for  specific  classes.  All  of 
these  error  rates  are  reasonable  estimates  of  future  performances  with  unseen  data. 
We  can  see  that  this  model  is  so  far  the  best  of  all  models,  although  it  is  still  not  good 
at  predicting  high  CHARLSONSCORE. 


Evaluating  Random  Forest  Performance 

The  caret  package  also  supports  random  forest  model  building  and  evaluation.  It 
reports  more  detailed  model  performance  evaluations.  As  usual,  we  need  to  specify  a 
re-sampling  method  and  a  parameter  grid.  As  an  example,  we  use  the  10-fold  CV 


508 


15  Improving  Model  Performance 


re-sampling  method.  The  grid  for  this  model  contains  information  about  the  mtry 
parameter  (the  only  tuning  parameter  for  random  forest).  Previously  we  tried  the 
default  value  v^38  =  6  (38  is  the  number  of  features).  This  time  we  could  compare 
multiple  mtry  parameters. 


Library ( caret) 

ctrL< - trainControL ( method= "cv ",  number=10) 
grid_rf < -expand. grid( .mtry=c (2 ,  4,  8,  16)) 

Next,  we  apply  the  train  ( )  function  with  our  Ctrl  and  grid_rf  settings. 

set.seed(123) 

m_rf  <-  train (CHARLSONSCORE  ~  data  =  qoLj  method  =  " rf ", 
metric  =  "Kappa ",  trControL  =  Ctrl, 
tuneGrid  =  grid_rf) 
m_rf 

##  Random  Forest 
## 

##  2328  samples 
##  38  predictor 

##  11  classes:  '0',  '1  ',  '2',  '3',  '4',  '5',  '6',  '7',  '8',  '9',  '10' 

## 

##  No  pre-processing 

##  Resampling :  Cross-Validated  (10  fold) 

##  Summary  of  sample  sizes:  2095 ,  2096 ,  2093 ,  2094,  2098,  2097,  ... 

##  Resampling  results  across  tuning  parameters : 


## 

## 

mtry 

Accuracy 

Kappa 

## 

2 

0.5223871 

0.1979731 

## 

4 

0.5403799 

0.2309963 

## 

8 

0.5382674 

0.2287595 

## 

## 

16 

0.5421562 

0.2367477 

##  Kappa  was  used  to  select  the  optimal  model  using  the  largest  value. 

##  The  final  value  used  for  the  model  was  mtry  =  16. 

This  call  may  take  a  while  to  complete.  The  result  appears  to  be  a  good  model, 
when  mtry=16  we  reached  a  relatively  high  accuracy  and  good  kappa  statistic. 
This  is  a  very  good  result  for  a  learner  with  1 1  classes. 


15.2.6  Adaptive  Boosting 

We  may  achieve  even  higher  accuracy  using  AdaBoost.  Adaptive  boosting 
(AdaBoost)  can  be  used  in  conjunction  with  many  other  types  of  learning  algorithms 
to  improve  their  performance.  The  output  of  the  other  learning  algorithms  (‘weak 
learners’)  is  combined  into  a  weighted  sum  that  represents  the  final  output  of  the 
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boosted  classifier.  AdaBoost  is  adaptive  in  the  sense  that  subsequent  weak  learners 
are  tweaked  in  favor  of  those  instances  misclassified  by  the  previous  classifiers. 

For  binary  classification,  we  could  use  ada  ( )  in  package  ada  and  for  multiple 
classes  (multinomial/polytomous  outcomes)  we  can  use  the  package  adabag.  The 
boosting  ( )  function  allows  us  to  select  a  type  method  by  setting  coeflearn. 
Two  prevalent  types  of  adaptive  boosting  methods  can  be  used.  One  is  AdaBoost. 
Ml  algorithm  including  Breiman  and  Freund,  and  the  other  is  Zhu’s  SAMME 
algorithm.  Let’s  see  some  examples: 

set.seed(123) 

qoL<-read.  cs\/  ("  https ://  umich .  instructure .  com/ files/ 481332/ down  Load  ?do\Ainload_ 
frd=l ") 

qoL<-qoL[ ! qoL$CHARLS0NSC0RE==-9  ,  -c(l,  2)] 
qoL$CHARLSONSCORE< -as  .factor (qoL$CHARLSONSCORE ) 

The  key  parameter  in  the  adabag :  :  boosting  ( )  method  is  coeflearn : 

•  Breiman  (default),  corresponding  to  a  =  5X  In  using  the  AdaBoost. Ml 

algorithm,  where  a  is  the  weight  updating  coefficient 

•  Freund ,  corresponding  to  a  =  In  fjr ) ,  or 

•  Zhu,  corresponding  to  a  =  In  +  In  ( nclasses  —  1). 

The  generalizations  of  AdaBoost  for  multiple  classes  (>2)  include  AdaBoost . 
Ml  (where  individual  trees  are  required  to  have  an  error  <  ^)  and  SAMME  (where 
individual  trees  are  required  to  have  an  error  <1 4 — ). 

•  install. packages("ada");  install. packages ("adabag") 
i ibrary ( "ada ");  L ibrary ( "adabag ") 

qoL_boost  <-  boosting (CHARLSONSCORE~. jdata=qoLjmfinaL  =  100, coeflearn  = 
'Breiman ' ) 

mean(qol_boost$class==qoL$CHARLSONSCORE) 

##  [1]  0.5425258 

qol_boost  <-  boosting (CHARLSONSCORE~. jdata=qoljmfinaL  =  100, coeflearn  = 
'Freund' ) 

mean(qoL_boost$class==qol$CHARLSONSCORE) 

##  [1]  0.5524055 

qol_boost  <-  boosting (CHARLSONSCORE~.  ,data=qol,  mfinal  =  100,  coeflearn  = 
'Zhu') 

mean(qoL_boost$class==qol$CHARLSONSCORE ) 

##  [1]  0.6542096 

We  observe  that  in  this  case,  the  Zhu  approach  achieves  the  best  results.  Notice 
that  the  default  method  is  Ml  Breiman  and  mfinal  is  the  number  of  iterations  for 
which  boosting  is  run  or  the  number  of  trees  to  use. 
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Try  applying  model  improvement  techniques  using  other  data  from  the  list  of  our 
Case-Studies  (Fig.  15.1). 


15.3  Assignment:  15.  Improving  Model  Performance 

Use  some  of  the  methods  below  to  do  classification,  prediction,  and  model  perfor¬ 
mance  evaluation  (Table  15.3). 


https://rdrr.io/cran/adabag/man/ 

adabag-package.html 


This  is  a  live  demo  of  the  Iris  flowers  classificaiton  using  adabag-package: 


(Applies  Multiclass  Ad  a  Boost  Ml,  SAM  ME  and  Bagging  algorithm) 
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jrji  .pr^tE^jjj^E  ■  t *fl - rj  i  f -m*.  (3 

1*1-1 

.minimi  *Li*  Hu 

titi  l-cCl-J  >  *i] > 
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Run 
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Fig.  15.1  Live  demo:  Iris  flowers  classification  using  adabag 


Table  15.3  Performance  evaluation  for  several  classification,  prediction,  and  clustering  methods 


Model 

Learning  task 

Method 

Parameters 

KNN 

Classification 

knn 

k 

Naive  Bayes 

Classification 

nb 

fL,  usekernel 

Decision  Trees 

Classification 

C5 . 0 

model,  trials,  winnow 

OneR  Rule  Learner 

Classification 

OneR 

None 

RIPPER  Rule  Learner 

Classification 

JRip 

NumOpt 

Linear  Regression 

Regression 

lm 

None 

Regression  Trees 

Regression 

rpart 

cp 

Model  Trees 

Regression 

M5 

pruned,  smoothed,  rules 

Neural  Networks 

Dual  use 

nnet 

size,  decay 

Support  Vector  Machines 
(Linear  Kernel) 

Dual  use 

svmLinear 

C 

Support  Vector  Machines 
(Radial  Basis  Kernel) 

Dual  use 

svmRadial 

C,  sigma 

Random  Forests 

Dual  use 

rf 

mtry 

References 
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15.3.1  Model  Improvement  Case  Study 

From  the  course  datasets,  use  the  05_PPMI_top_UPDRS_Integrated_LongFormatl. 

csv  case-study  data  to  perform  a  multi-class  prediction. 

Use  ResearchGroup  as  response,  which  have  “PD”,  “Control”  and 

“SWEDD”  three  classes. 

•  Delete  ID  column,  impute  missing  value  with  mean  or  median  and  justify  your 
choice. 

•  Normalize  the  covariates. 

•  Implement  automated  parameter  tuning  process  and  report  the  optimal  accuracy 
and  k. 

•  Set  arguments  and  rerun  the  tuning,  trying  different  method  and  number 
settings. 

•  Train  a  random  forest,  tune  the  parameters,  report  the  result  and  output  cross 
table. 

•  Use  bagging  algorithm  and  report  the  accuracy  and  k. 

•  Perform  randomForest  and  report  the  accuracy  and  k. 

•  Report  the  accuracy  by  AdaBoost  and  make  sure  to  try  all  three  methods. 

•  Finally,  give  a  brief  summary  about  all  the  model  improvement  approaches. 

•  Try  the  procedure  on  other  data  in  the  list  of  Case-Studies,  e.g.,  Traumatic  Brain 
Injury  Study  and  the  corresponding  dataset. 
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Chapter  16 

Specialized  Machine  Learning  Topics 


® 

Check  for 
updates 


This  chapter  presents  some  technical  details  about  data  formats,  streaming,  optimization 
of  computation,  and  distributed  deployment  of  optimized  learning  algorithms. 
Chapter  22  provides  additional  optimization  details.  We  show  format  conversion  and 
working  with  XML,  SQL,  JSON,  15  CSV,  SAS  and  other  data  objects.  In  addition,  we 
illustrate  SQL  server  queries,  describe  protocols  for  managing,  classifying  and  predicting 
outcomes  from  data  streams,  and  demonstrate  strategies  for  optimization,  improvement 
of  computational  performance,  parallel  (MPI)  and  graphics  (GPU)  computing. 

The  Internet  of  Things  (IoT)  leads  to  a  paradigm  shift  of  scientific  inference  - 
from  static  data  interrogated  in  a  batch  or  distributed  environment  to  on-demand 
service-based  Cloud  computing.  Here,  we  will  demonstrate  how  to  work  with 
specialized  data,  data-streams,  and  SQL  databases,  as  well  as  develop  and  assess 
on-the-fly  data  modeling,  classification,  prediction  and  forecasting  methods.  Impor¬ 
tant  examples  to  keep  in  mind  throughout  this  chapter  include  high-frequency  data 
delivered  real  time  in  hospital  ICU’s  (e.g.,  microsecond  Electroencephalography 
signals,  EEGs),  dynamically  changing  stock  market  data  (e.g.,  Dow  Jones  Industrial 
Average  Index,  DJI),  and  weather  patterns. 

We  will  present  (1)  format  conversion  of  XML,  SQL,  JSON,  CSV,  SAS  and  other 
data  objects,  (2)  visualization  of  bioinformatics  and  network  data,  (3)  protocols  for 
managing,  classifying  and  predicting  outcomes  from  data  streams,  (4)  strategies  for 
optimization,  improvement  of  computational  performance,  parallel  (MPI)  and 
graphics  (GPU)  computing,  and  (5)  processing  of  very  large  datasets. 


16.1  Working  with  Specialized  Data  and  Databases 

Unlike  the  case  studies  we  saw  in  the  previous  chapters,  some  real  world  data  may 
not  always  be  nicely  formatted,  e.g.,  as  CSV  files.  We  must  collect,  arrange,  wrangle, 
and  harmonize  scattered  information  to  generate  computable  data  objects  that  can  be 
further  processed  by  various  techniques.  Data  wrangling  and  preprocessing  may  take 
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over  80%  of  the  time  researchers  spend  interrogating  complex  multi-source  data 
archives.  The  following  procedures  will  enhance  your  skills  in  collecting  and  han¬ 
dling  heterogeneous  real  world  data.  Multiple  examples  of  handling  long-and-wide 
data,  messy  and  tidy  data,  and  data  cleaning  strategies  can  be  found  in  this  JSS  Tidy 
Data  article  by  Hadley  Wickham. 


16.1.1  Data  Format  Conversion 

The  R  package  rio  imports  and  exports  various  types  of  file  formats,  e.g., 
tab-separated  (.tsv),  comma-separated  (.csv),  JSON  (.  json),  Stata  (.dta), 
SPSS  ( .  sav  and  .  por),  Microsoft  Excel  ( .  xl  s  and  .  xl  sx),  Weka  ( .  ar f  f ),  and 
SAS  (.sas7bdat  and  .xpt). 

rio  provides  three  important  functions  import  ( ) ,  export  ( )  and  convert  ( ) . 
They  are  intuitive,  easy  to  understand,  and  efficient  to  execute.  Take  Stata  (.dta)  files 
as  an  example.  First,  we  can  download  02_Nofl_Data.dta  from  our  datasets  folder. 


#  install. packages("rio") 

Library (rio) 

#  Download  the  SAS  .DTA  file  first  locally 

#  Local  data  can  be  loaded  by: 

#nofl<-import ( "02_Nofl_Data . dta" ) 

#  the  data  can  also  be  loaded  from  the  server  remotely  as  well: 
nofl<-read. csv ( "https : //umich .instructure . com/fiLes/330385/down Load ?down Load 
_frd=l ") 

str(nofl ) 


## 

## 

## 

## 

## 

## 

## 


' data,  frame  ' : 

$  ID  :  int 
$  Day  :  int 
$  Tx  :  int 
$  SeLfEff  :  int 
$  SeLfEff 25 :  int 
$  IaIPSS  :  num 


900  obs.  of  10  variabLes : 
1111111111  ... 
123456789  10... 
1100110011  ... 

33  33  33  33  33  33  33  33  33  33  .. 
8888888888  ... 

0.97  -0.17  0.81  -0.41  0.59  -1.16 


0.3 


-0.34  -0.74  -0.38 


•  •  • 

## 

## 

## 

## 


$  SocSuppt  :  num 
$  PMss  :  num 
$  PMss3  :  num 
$  PhyAct  :  int 


5  3.87  4.84  3.62  4.62  2.87  4.33  3.69  3.29  3.66  ... 
4.03  4.03  4.03  4.03  4.03  4.03  4.03  4.03  4.03  4.03  ... 
1.03  1.03  1.03  1.03  1.03  1.03  1.03  1.03  1.03  1.03 
53  73  23  36  21  0  21  0  73  114  ... 


The  data  are  automatically  stored  as  a  data  frame.  Note  that  rio  sets 
stingAsFactors=FALSE  as  default. 

rio  can  help  us  export  files  into  any  other  format  we  choose.  To  do  this  we  have 
to  use  the  export  ( )  function. 


#Sys.getenv("R_ZIPCMD'L  "zip")  #  Get  the  C  Zip  application 
Sys . setenv(R_ZIPCMD= " E : /TooLs/ZIP/bin/zip . exe ") 

Sys .  getenv (  " R_ZIPCMD  ",  "zip  ") 

##  [1]  "E : /TooLs/ZIP/bin/zip.  exe" 

export  (noflj  "02_Nofl .  xLsx") 
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This  line  of  code  exports  the  Nofl  data  in  xlsx  format  located  in  the  R  working 
directory.  Mac  users  may  have  a  problem  exporting  *.xslx  files  using  rio 
because  of  a  lack  of  a  zip  tool,  but  still  can  output  other  formats  such  as  ".csv". 
An  alternative  strategy  to  save  an  xlsx  file  is  to  use  package  xlsx  with  default 
row .  name=TRUE. 

rio  also  provides  a  one  step  process  to  convert  and  save  data  into  alternative 
formats.  The  following  simple  code  allows  us  to  convert  and  save  the 
02_Nof  l_Data  .  dta  file  we  just  downloaded  into  a  CSV  file. 

#  convert ("02_Nofl_Data. dta",  "02_Nofl_Data . csv" ) 
convert ("02_Nofl .  xlsx ", 

"02_Nofl_Data .  csv") 

You  can  see  a  new  CSV  file  popup  in  the  current  working  directory.  Similar 
transformations  are  available  for  other  data  formats  and  types. 


16.1.2  Querying  Data  in  SQL  Databases 

Let’s  use  as  an  example  the  CDC  Behavioral  Risk  Factor  Surveillance  System 
(BRFSS)  Data,  2013-2015.  This  file  for  the  combined  landline  and  cell  phone  data 
set  was  exported  from  SAS  V9.3  in  the  XPT  transport  format.  This  file  contains 
330  variables  and  can  be  imported  into  SPSS  or  STATA.  Please  note:  some  of  the 
variable  labels  get  truncated  in  the  process  of  converting  to  the  XPT  format. 

Be  careful  -  this  compressed  (ZIP)  file  is  over  315MB  in  size! 

#  install. packages("Hmisc") 

Library (Hmisc) 

memory . size ( max= T ) 

##  [1]  115.81 
pathToZip  <-  tempfile( ) 

down  Load .  fiLe (" http : //www . socr. umich.edu/data/DSPA/BRFSS_2013_2014_2015.zip" 

,  pathToZip) 

#  let's  just  pull  two  of  the  3  yeans  of  data  (2013  and  2015) 
brfss_2013  <-  sasxport.get(unzip(pathToZip) [1] ) 

##  Processing  SAS  dataset  LLCP2013 

brfss_2015  <-  sasxport.get(unzip(pathToZip) [3] ) 

##  Processing  SAS  dataset  LLCP2015 

dim (b rfs s_2013);  object .size (b rfs s_2013) 

##  [1]  491773  336 


##  685581232  bytes 
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#  summary(brfss_2013[l: 1000,  1:10])  #  subsample  the  data 

#  report  the  summaries  for 
summary (brfss_2013$has_pLan ) 

##  Length  CLass  Mode 

##  0  NULL  NULL 

brfss_2013$x. race  <-  as. factor (brfss_2013$x. race) 
summary (brfss_2013$x . race ) 

##  1  2  3  4  5  6  7  8  9  NA's 

##  376451  39151  7683  9510  1546  2693  9130  37054  8530  25 

#  clean  up 
unLink(pathToZip) 

Let’s  try  to  use  logistic  regression  to  find  out  if  self-reported  race/ethnicity 
predicts  the  binary  outcome  of  having  a  health  care  plan. 

brfss_2013$has_pLan  <-  brfss_2013$hLthpLnl  ==  1 
system. time( 

gmLl  <-  gLm(has_pLan  ~  as. factor (x. race) j  data=brfss_2013j 
famiLy=binomiaL ) 

)  #  report  execution  time 

##  user  system  e Lapsed 

##  2.20  0.23  2.46 

summary (gmLl ) 

## 

##  CaLL: 

##  gLm(formuLa  =  has  pLan  ~  as .factor (x. race) .  famiLy  =  binomiaL. 

##  data  =  brfss_2013) 

## 

##  Deviance  ResiduaLs : 

##  Min  IQ  Median  3Q  Max 

##  -2.1862  0.4385  0.4385  0.4385  0.8047 

## 

##  Coefficients: 


## 

Estimate 

Std.  Error 

z  vaLue 

~o 

V 

N 

##  (Intercept)  2.293549 

0.005649 

406.044 

<2e-16 

*  *  * 

## 

as .factor (x.race)2  -0.721676 

0.014536 

-49.647 

<2e-16 

*  *  * 

## 

as .factor (x.race)3  -0.511776 

0.032974 

-15.520 

<2e-16 

*  *  * 

## 

as .factor (x. race)4  -0.329489 

0.031726 

-10.386 

<2e-16 

*  *  * 

## 

as .factor(x. race)5  -1.119329 

0.060153 

-18.608 

<2e-16 

*  *  * 

## 

as .factor (x. race) 6  -0.544458 

0.054535 

-9.984 

<2e-16 

*  *  * 

## 

as .factor(x. race)7  -0.510452 

0.030346 

-16.821 

<2e-16 

*  *  * 

## 

as .factor (x.race)8  -1.332005 

0.012915 

-103.138 

<2e-16 

*  *  * 

## 

## 

## 

as .factor (x. race)9  -0.582204 

0.030604 

-19.024 

<2e-16 

*  *  * 

Signif.  codes:  0  '***'  0.001 

'**'  0'01 

0.05 

' .  '  0.1 

'  '  1 

## 

##  (Dispersion  parameter  for  binomiaL  famiLy  taken  to  be  1) 
## 
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##  Null  deviance:  353371  on  491747  degrees  of  freedom 
##  Residua L  deviance:  342497  on  491739  degrees  of  freedom 
##  (25  observations  deleted  due  to  missingness) 

##  AIC:  342515 
## 

##  Number  of  Fisher  Scoring  iterations :  5 

Next,  we’ll  examine  the  odds  (rather  the  log  odds  ratio,  LOR)  of  having  a  health 
care  plan  (HCP)  by  race  (R).  The  LORs  are  calculated  for  two  array  dimensions, 
separately  for  each  race  level  (presence  of  health  care  plan  (HCP)  is  binary,  whereas 
race  (R)  has  9  levels,  Rl,  R2 ,  . . .,  R9).  For  example,  the  odds  ratio  of  having  a  HCP 
for  Rl  :  R2  is: 


OR(Rl  :  R2) 


P{HCP\R\) 

1  -P(HCP\Rl) 

P{HCP\R2)  ' 
l-P(HCP\R2) 


#load  the  vcd  package  to  compute  the  LOR 
Library ("vcd") 

##  Loading  required  package:  grid 

Lor_HCP_by_R  <-  Loddsratio(has_pLan  ~  as, factor (x. race) j  data  =  brfss_2013) 
Lor_HCP_by_R 

##  Log  odds  ratios  for  has_pLan  and  as  .factor (x. race) 

## 

##  1:2  2:3  3:4  4:5  5:6  6:7 

##  -0.72167619  0.20990061  0.18228646  -0.78984000  0.57487142  0.03400611 

##  7:8  8:9 

##  -0.82155382  0.74980101 

Now,  let’s  see  an  example  of  querying  a  database  containing  structured  relational 
collection  of  data  records.  A  query  is  a  machine  instruction  (typically  represented  as 
text)  sent  by  a  user  to  a  remote  database  requesting  a  specific  database  operation 
(e.g.,  search  or  summary).  One  database  communication  protocol  relies  on  SQL 
(Structured  query  language).  MySQL  is  an  instance  of  a  database  management 
system  that  supports  SQL  communication  and  is  utilized  by  many  web  applications, 
e.g.,  YouTube,  Flickr,  Wikipedia,  biological  databases  like  GO,  ensembl,  etc.  Below 
is  an  example  of  an  SQL  query  using  the  package  RMySQL.  An  alternative  way  to 
interface  an  SQL  database  is  using  the  package  RODBC. 


#  install. packages("DBI");  install. packages("RMySQL") 

#  install. packages("RODBC");  library(RODBC) 

Library (DBI) 

Library ( RMySQL ) 

ucscGenomeConn  <-  dbConnect (MySQL ( ) j 

user=  'genome 
dbname=  'hg38 

host= 'genome-mysqL.cse.ucsc.edu' ) 
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result  <-  dbGetQuery(ucscGenomeConnj  "show  databases; ") ; 

#  List  the  DB  tables 

aLLTables  <-  dbListTables(ucscGenomeConn) ;  Length(aLLTables) 

#  Get  dimensions  of  a  table,  read  and  report  the  head 
dbListFields(ucscGenomeConnj  "affy(J133PLus2" ) 

affyData  <-  dbReadTable(ucscGenomeConnj  "affyU133PLus2" ) ;  head(affyData) 

#  Select  a  subset,  fetch  the  data,  and  report  the  quantiles 
subsetQuery  <-  dbSendQuery(ucscGenomeConnj  "select  *  from  affyU133Plus2 
where  mismatches  between  1  and  3") 

affySmall  <-  fetch (subsetQuery ) ;  quantile (a ffySmall$misMatches) 

#  Get  repeat  mask 
bedFile  <-  ' repUCSC . bed ' 

df  <-  dbSendQuery(ucscGenomeConnj  'select  genoNamej genoStartj genoEndj 
repNamej  swScorej  strand }  repCLasSj  repFamily  from  rmsk' )  %>% 
dbFetch (n=-l)  %>% 

mutate (genoName  =  str_replace (genoNamej  'chr'j  ''))  %>% 
tbl_df  %>% 

write_tsv( bedFile j col_names=F) 
message (’written  bedFile) 

#  Once  done,  close  the  connection 
dbDisconnect( ucscGenomeConn ) 

To  complete  the  above  database  SQL  commands,  it  requires  access  to  the  remote 
UCSC  SQL  Genome  server  and  user-specific  credentials.  You  can  see  this  functional 
example  on  the  DSPA  website.  Below  is  another  example  that  can  be  done  by  all 
readers,  as  it  relies  only  on  local  services. 

#  install. packages("RSQLite") 

L ibrary ( "RSQL i te ") 

#  generate  an  empty  DB  and  store  it  in  RAM 
myConnection  <-  dbConnect(RSQLite : :  SQLite( ) ,  "  '.memory :") 
myConnection 

##  <SQLiteConnection> 

##  Path:  : memory: 

##  Extensions:  TRUE 

dbListTables ( myConnection ) 

##  character(Q) 

#  Add  tables  to  the  local  SQL  DB 

data(USArrests) ;  dbWriteTable (myConnection ,  "USArrests" ,  USArrests) 

##  [1]  TRUE 

dbblriteTable (myConnection ,  "brfss_2013" ,  brfss_2013) 

##  [1]  TRUE 

dbblriteTable (myConnection ,  "brfss_2015"j  brfss_2015) 

##  [1]  TRUE 
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#  Check  again  the  DB  content 
dbLi.stFi.eids  ( my  Connection ,  "brfss_2013  ") 


## 

[i] 

"x. state" 

"f month " 

"idate " 

"imonth " 

"iday" 

## 

[6] 

"iyeor" 

"dispcode " 

"segno" 

"x.psu" 

"cteienum 

## 

[11] 

"pvtresdl " 

"coighous " 

"stateres" 

"ceiifon3" 

"Laduit" 

## 

[16] 

"numaduLt" 

"nummen " 

"numiAjomen  " 

"genhith " 

"physhith 

## 

[21] 

"menthLth " 

"poorhith " 

"hLthpLnl " 

"persdoc2" 

"medcost" 

##  [331] 

"rcsbracl " 

"rcsracel " 

"rchisiol " 

"rcsbirth " 

"typeinds 

##  [336] 

"typework " 

"has_pLan " 

dbListTabies(myConnection); 


##  [1]  "USArrests"  "brfss_2013"  "brfss_2015" 

#  Retrieve  the  entire  DB  table  (for  the  smaller  USArrests  table) 
dbGetQuery (my Connection ,  "SELECT  *  FROM  USArrests" ) 


## 

Murder  Assouit  UrbonPop 

Rape 

## 

1 

13.2 

236 

58 

21. 

2 

## 

2 

10.0 

263 

48 

44. 

5 

## 

3 

8.1 

294 

80 

31. 

0 

## 

4 

8.8 

190 

50 

19. 

5 

## 

5 

9.0 

276 

91 

40. 

6 

## 

6 

7.9 

204 

78 

38. 

7 

## 

7 

3.3 

110 

77 

11. 

1 

## 

S 

5.9 

238 

72 

15. 

8 

## 

9 

15.4 

335 

80 

31. 

9 

## 

10 

17.4 

211 

60 

25. 

8 

## 

11 

5.3 

46 

83 

20. 

2 

## 

12 

2.6 

120 

54 

14. 

2 

## 

13 

10.4 

249 

83 

24. 

0 

## 

14 

7.2 

113 

65 

21. 

0 

## 

15 

2.2 

56 

57 

11. 

3 

## 

16 

6.0 

115 

66 

18. 

0 

## 

17 

9.7 

109 

52 

16. 

3 

## 

IS 

15.4 

249 

66 

22. 

2 

## 

19 

2.1 

83 

51 

7. 

8 

## 

20 

11.3 

300 

67 

27. 

8 

## 

21 

4.4 

149 

85 

16. 

3 

## 

22 

12.1 

255 

74 

35. 

1 

## 

23 

2.7 

72 

66 

14. 

9 

## 

24 

16.1 

259 

44 

17. 

1 

## 

25 

9.0 

178 

70 

28. 

2 

## 

26 

6.0 

109 

53 

16. 

4 

## 

27 

4.3 

102 

62 

16. 

5 

## 

2S 

12.2 

252 

81 

46. 

0 

## 

29 

2.1 

57 

56 

9. 

5 

## 

30 

7.4 

159 

89 

18. 

8 

## 

31 

11.4 

285 

70 

32. 

1 

## 

32 

11.1 

254 

86 

26. 

1 

## 

33 

13.0 

337 

45 

16. 

1 

## 

34 

0.8 

45 

44 

7. 

3 

## 

35 

7.3 

120 

75 

21. 

4 

## 

36 

6.6 

151 

68 

20. 

0 

## 

37 

4.9 

159 

67 

29. 

3 

## 

3S 

6.3 

106 

72 

14. 

9 
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##  39 

3.4 

174 

87 

8.3 

##  40 

14.4 

279 

48 

22.5 

##  41 

3.8 

86 

45 

12.8 

##  42 

13.2 

188 

59 

26.9 

##  43 

12.7 

201 

80 

25.5 

##  44 

3.2 

120 

80 

22.9 

##  45 

2.2 

48 

32 

11.2 

##  46 

8.5 

156 

63 

20.7 

##  47 

4.0 

145 

73 

26.2 

##  48 

5.7 

81 

39 

9.3 

##  49 

2.6 

53 

66 

10.8 

##  50 

6.8 

161 

60 

15.6 

#  Retrieve  just  the  average  of  one  feature 

myQuery  <-  dbGetQuery (my Connection ,  "SELECT  avg(AssauLt)  FROM  USArrests" ) ;  m 
y Query 

##  oyg(AssauLt) 

##  1  170.76 

myQuery  <-  dbGetQuery (my Connection ,  "SELECT  avg(AssuuLt)  FROM  USArrests  GROU 
P  BY  UrbanPop" ) ;  myQuery 


## 

avg(AssauLt) 

## 

1 

48. 

00 

## 

2 

81. 

00 

## 

3 

152. 

00 

## 

4 

211. 

50 

## 

5 

271. 

00 

## 

6 

190. 

00 

## 

7 

83. 

00 

## 

8 

109. 

00 

## 

9 

109. 

00 

## 

10 

120. 

00 

## 

11 

57. 

00 

## 

12 

56. 

00 

## 

13 

236. 

00 

## 

14 

188. 

00 

## 

15 

186. 

00 

## 

16 

102. 

00 

## 

17 

156. 

00 

## 

18 

113. 

00 

## 

19 

122. 

25 

## 

20 

229. 

50 

## 

21 

151. 

00 

## 

22 

231. 

50 

## 

23 

172. 

00 

## 

24 

145. 

00 

## 

25 

255. 

00 

## 

26 

120. 

00 

## 

27 

110. 

00 

## 

28 

204. 

00 

## 

29 

237. 

50 

## 

30 

252. 

00 

## 

31 

147. 

50 

## 

32 

149. 

00 

## 

33 

254. 

00 

## 

34 

174. 

00 

## 

35 

159. 

00 

## 

36 

276. 

00 
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#  On  do  it  in  batches  (for  the  much  larger  brfss_2013  and  brfss_2015  tables) 
myQuery  <-  dbGetQuery (myConnectionj  "SELECT  *  FROM  brfss_2013" ) 

#  extract  data  in  chunks  of  2  rows,  note:  dbGetQuery  vs.  dbSendQuery 

#  myQuery  <-  dbSendQuery(myConnection,  "SELECT  *  FROM  brf ss_2013" ) 

#  fetch2  <-  dbFetch(myQuery,  n  =  2);  fetch2 

#  do  we  have  other  cases  in  the  DB  remaining? 

#  extract  all  remaining  data 

#  fetchRemaining  <-  dbFetch(myQuery,  n  =  -1) jfetchRemaining 

#  We  should  have  all  data  in  DB  now 

#  dbHasCompleted(myQuery) 

#  compute  the  average  (poorhlth)  grouping  by  Insurance  (hlthplnl) 

#  Try  some  alternatives:  numadult  nummen  numwomen  genhlth  physhlth  menthlth 
poorhlth  hlthplnl 

myQueryl_13  <-  dbGetQuery (myConnectionj  "SELECT  avg(poorhLth)  FROM  brfss_201 
3  GROUP  BY  hLthpLnl ");  myQueryl_13 

##  avg(poorhLth) 

##  1  56.25466 

##  2  53.99962 

##  3  58.85072 

##  4  66.26757 

#  Compare  2013  vs.  2015:  Health  grouping  by  Insurance 

myQueryl_15  <-  dbGetQuery  (myConnectiorij  "SELECT  avg(poorhLth)  FROM  brfss_201 
5  GROUP  BY  hLthpLnl ");  myQueryl_15 

##  avg(poorhLth) 

##  1  55.75539 

##  2  55.49487 

##  3  61. 35445 

##  4  67.62125 

my Query 1_13  -  myQueryl_15 

##  avg(poorhLth) 

##  1  0.4992652 

##  2  -1.4952515 

##  3  -2.5037326 

##  4  -1.3536797 

#  reset  the  DB  query 

#  dbClearResult (myQuery) 

#  clean  up 

dbDisconnect (my Connection ) 

##  [1]  TRUE 


16.1.3  Real  Random  Number  Generation 

We  are  already  familiar  with  (pseudo)  random  number  generation  (e.g.,  rnorm 
(10  0,  10,  4 )  or  runi f  (100,  10,20)),  which  generate  algorithmically 
computer  values  subject  to  specified  distributions.  There  are  also  web  services, 
e.g.,  random.org,  that  can  provide  true  random  numbers  based  on  atmospheric 
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noise,  rather  than  using  a  pseudo  random  number  generation  protocol.  Below  is  one 
example  of  generating  a  total  of  300  numbers  arranged  in  3  columns,  each  of 
100  rows  of  random  integers  (in  decimal  format)  between  100  and  200. 


#https : //www. random . org/ integers/ ?num=300&min=100&max=200&col=3&base=10& 
format=plain&rnd=new 

siteURL  <-  "http://random.org/integers/"  #  base  URL 

shortQuery<-  "num=300&min=100&max=200&coL=3&base=10&format=pLain&rnd=new" 
compLeteQuery  <-  paste (siteURL }  shortQuery ,  sep="P")  #  concat  url  and 
submit  query  string 

rngNumbers  <-  read.tabLe(fiLe=compLeteQuery)  #  and  read  the  data 
rngNumbers 

##  VI  V2  V3 

##  1  144  179  131 

##  2  127  160  150 

##  3  142  169  109 

##  98  178  103  134 

##  99  173  178  156 

##  100  117  118  110 


16.1.4  Downloading  the  Complete  Text  of  Web  Pages 

RCurl  package  provides  an  amazing  tool  for  extracting  and  scraping  information 
from  websites.  Let’s  install  it  and  extract  information  from  a  SOCR  website. 


#  install. packages("RCurl") 

Library (RCurL ) 

##  Loading  required  package:  bitops 

web<-getURL( "http: //wiki. socr. umich . edu/index . php/SOCR_Data " ,  foLLowLocation 
=  TRUE) 

str(webj  nchar.max  =  200) 

##  chr  "< ! DOCTYPE  htmL>\n<htmL  Lang=\"en\"  dir=\" Ltr\"  cLass=\"cLient-nojs\ 
">\n<head>\n<meta  charset=\"UTF-8\"  />\n<titLe>SOCR  Data  -  SOCR</titLe>\n<me 
ta  http-equiv=\"X-UA-CompatibLe\"  content=\"IE=EDGE\"  />"\  _ truncated _ 

The  web  object  looks  incomprehensible.  This  is  because  most  websites  are 
wrapped  in  XML/HTML  hypertext  or  include  JSON  formatted  metadata.  RCurl 
deals  with  special  HTML  tags  and  website  metadata. 

To  deal  with  the  web  pages  only,  httr  package  would  be  a  better  choice  than 
RCurl.  It  returns  a  list  that  makes  much  more  sense. 

#install . packages ( "httr" ) 

Library (httr) 

\Aieb< -GET ("http: //wiki,  socr .  umich.edu/index.php/SOCR_Data") 
str(web[l:3] ) 

##  List  of  3 

##  $  urL  :  chr  "http://wiki.socr.umich.edu/index.php/SOCR_Data" 

##  $  status_code :  int  200 
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## 

$  headers  : List  of  12 

## 

. . $  date 

chr 

"Mon,  03  Jul  2017  19:09:56 

GMT" 

## 

. .$  server 

chr 

"Apache/2.2.15  (Red  Hat)" 

## 

..$  x-powered-by 

chr 

"PHP/5.3.3" 

## 

..$  x-content-type-options 

chr 

"nosniff" 

## 

..$  content- Language 

chr 

"en  " 

## 

. . $  vary 

chr 

"Accept -Encoding,  Cookie" 

## 

. . $  expires 

chr 

"Thu,  01  Jan  1970  00:00:00 

GMT" 

## 

..$  cache-control 

chr 

"private,  must-revalidate , 

max-age=0" 

## 

..$  Last-modified 

chr 

"Sat,  22  Oct  2016  21:46:21 

GMT" 

## 

..$  connection 

chr 

"close" 

## 

..$  transfer-encoding 

chr 

"chunked" 

## 

..$  content-type 

chr 

"text/html;  charset=UTF-8" 

## 

..-  attr(*j  "class" )=  chr  [1:2] 

"insensitive"  "List" 

16.1.5  Reading  and  Writing  XML  with  the  XML  Package 

A  combination  of  the  RCurl  and  the  XML  packages  could  help  us  extract  only  the 
plain  text  in  our  desired  webpages.  This  would  be  very  helpful  to  get  information 
from  heavy  text-based  websites. 

iAjeb< -getURL (" http :/ /wiki .  socr .  umich .  edu/index. php/SOCR_Data" ,  foLLowlocation 
=  TRUE) 

#install . packages ( "XML" ) 

Library (XML ) 

web . parsed<-htmLParse(web,  asText  =  T) 

plain .text<-xpathSApply(web. parsed,  ”//p",  xmLVaLue) 

cat(paste(pLain.textj  coLLapse  =  "\n")) 

##  The  Links  below  contain  a  number  of  datasets  that  may  be  used  for  demonst 
ration  purposes  in  probability  and  statistics  education .  There  are  two  types 
of  data  -  simulated  (computer-generated  using  random  sampling)  and  observed 
(research,  observationally  or  experimentally  acquired) . 

## 

##  The  SOCR  resources  provide  a  number  of  mechanisms  to  simulate  data  using 
computer  random-number  generators .  Here  are  some  of  the  most  commonly  used  S 
OCR  generators  of  simulated  data: 

## 

##  The  following  collections  include  a  number  of  real  observed  datasets  from 
different  disciplines,  acquired  using  different  techniques  and  applicable  in 
different  situations . 

## 

##  In  addition  to  human  interactions  with  the  SOCR  Data ,  we  provide  several 
machine  interfaces  to  consume  and  process  these  data. 

## 

##  Translate  this  page: 

## 

##  (default) 

## 

##  Deutsch 

##  Romania 
## 

##  Sverige 
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Here  we  extracted  all  plain  text  between  the  starting  and  ending  paragraph 
HTML  tags,  <p>  and  </p>. 

More  information  about  extracting  text  from  XML/HTML  to  text  via  XPath  is 
available  online. 


16.1.6  Web-Page  Data  Scraping 

The  process  that  extracting  data  from  complete  web  pages  and  storing  it  in  structured 
data  format  is  called  scraping.  However,  before  starting  a  data  scrape  from  a 
website,  we  need  to  understand  the  underlying  HTML  structure  for  that  specific 
website.  Also,  we  have  to  check  the  terms  of  that  website  to  make  sure  that  scraping 
from  this  site  is  allowed. 

The  R  package  rvest  is  a  very  good  place  to  start  “harvesting”  data  from 
websites. 

To  start  with,  we  use  read  html  ( )  to  store  the  SOCR  data  website  into  a 
xml  node  object. 

Library (rvest) 

SOCR<-read_htmL ( " http ://wiki . socr. umich . edu/index . php/SOCR_Data ") 

SOCR 

##  {xmL_document} 

##  <htmL  Lang="en"  dir="Ltr"  cLass="cLient-nojs"> 

##  [1]  <head>\n<meta  http-equiv="Content-Type"  content="text/htmL ;  charset= 

•  •  • 

##  [2]  <body  cLass="media\Aiiki  Ltr  sitedir- Ltr  ns-0  ns-subject  page-SOCR_Dat 


From  the  summary  structure  of  SOCR,  we  can  discover  that  there  are  two  important 
hypertext  section  markups  <head>  and  <body>.  Also,  notice  that  the  SOCR  data 
website  uses  < t i 1 1  e >  and  </title>  tags  to  separate  title  in  the  <head>  section. 
Let’s  use  html_node  ( )  to  extract  title  information  based  on  this  knowledge. 

SOCR  %>%  htmL_node( "head  title")  %>%  htmL_text( ) 

##  [1]  "SOCR  Data  -  SOCR" 

Here  we  used  %>%  operator,  or  pipe,  to  connect  two  functions.  The  above  line  of 
code  creates  a  chain  of  functions  to  operate  on  the  SOCR  object.  The  first  function  in 
the  chain  html_node  ( )  extracts  the  title  from  head  section.  Then, 
html_text  ()  translates  HTML  formatted  hypertext  into  English.  More  on  R 
piping  can  be  found  in  the  magrittr  package. 

Another  function,  rvest :  :html_nodes  ( )  can  be  very  helpful  in  scraping. 
Similar  to  html_node  ( ) ,  html_nodes  ( )  can  help  us  extract  multiple  nodes  in 
an  xml  node  object.  Assume  that  we  want  to  obtain  the  meta  elements  (usually  page 
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description,  keywords,  author  of  the  document,  last  modified,  and  other  metadata) 
from  the  SOCR  data  website.  We  apply  html_nodes  ()  to  the  SOCR  object  to 
extract  the  hypertext  data,  e.g.,  lines  starting  with  <meta>  in  the  <head>  section  of 
the  HTML  page  source.  It  is  optional  to  use  html_attrs  () ,  which  extracts 
attributes,  text  and  tag  names  from  HTML,  obtain  the  main  text  attributes. 


meta<-SOCR  %>%  htmL_nodes( "head  meta")  %>%  htmi_attrs( ) 
meta 

##  [[1]] 

##  http-equiv  content 

##  "Content-Type"  " text/htmL ;  charset=UTF-8" 

## 

##  [[2]] 

##  charset 
##  "UTF-8" 

## 

##  [[3]] 


## 

http-equiv 

content 

##  "X-UA- 

## 

■Compatible" 

"IE=EDGE " 

##  [[4]] 
## 

name 

content 

## 

"generator" 

"MedialA/iki  1.23.1" 

## 

##  [[5]] 
## 

name 

content 

##  "ResourceLoaderDynamicStyLes " 

u  ii 

16.1.7  Parsing  J SON  from  Web  APIs 

Application  Programming  Interfaces  (APIs)  allow  web-accessible  functions  to  com¬ 
municate  with  each  other.  Today  most  API  is  stored  in  JSON  (JavaScript  Object 
Notation)  format. 

JSON  represents  a  plain  text  format  used  for  web  applications,  data  structures  or 
objects.  Online  JSON  objects  could  be  retrieved  by  packages  like  RCurl  and  httr. 
Let’s  see  a  JSON  formatted  dataset  first.  We  can  use  02_Nofl_Data.json  in  the  class 
file  as  an  example. 

Library (httr) 

nof  1< -GET ( "https : //umich . instructure. com/fiLes/1760327/do^nLoad?do^nLoad_frd 
=1 ") 
nofl 

##  Response  [https : //instructure-upLoads . s3 . amazonavis .  com/ account _177 0000000 
0000001/attachment s/1 760327 /02_Nof IJData .j son? response -content -disposition=a 
ttachmen  t%3B%20fi  L  ename%3D%2202_Nofl_Data .  json%22%3B%20fi  L  ename%2A%3DUTF  -  8%2 
7%2702%255FNofl%255FData.json&X-Amz-ALgorithm=AlAlS4-HmC-SHA256&X-Arnz-Credent 
iaL=AKIAJFNFXH2V2O7RPCAA%2F20170703%2Fus-east-l%2Fs3%2FaiAJs4_request&X-Amz-Da 
te=201 70703T190959Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-5igna 
ture=ceb3be3e71d9c370239bab558fcb0191bc829b98a7ba61ac86e27a2fc3cle8ce ] 
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##  Date:  2017-07-03  19:10 
##  Status :  200 

##  Content-Type :  appLication/json 
##  Size:  109  kB 

##  [{"ID" : lj  "Day" :lj  "Tx":lJ  "SeifEff":33J  "SeLfEff25":8J  "UPSS". . . 

##  {"ID" :lj  "Day "  :2}  "Tx":l}  "SelfEff":33J  "SeLfEff25":8J  "UPSS". . . 

##  {"ID" :lj  "Day" :3j  "Tx":0J  "SelfEff":33}  "SelfEff25":8J  "UPSS".  .  . 

##  {"ID" :lj  "Day" :4j  "Tx":0J  "SelfEff":33}  "SeLfEff25":8J  "UPSS". . . 

##  {"ID":lj  "Day" -.5,  "Tx":l}  "SelfEff":33}  "SelfEff25":8J  "Ia/PSS  ".  .  . 

##  {"ID" :1,  "Day" :6}  "Tx":l}  "SeLfEff":33j  "SeLfEff25":8J  "UPSS". . . 

##  {"ID" :lj  "Day " : 7 }  "Tx":0}  "SelfEff":33}  "SeLfEff25":8J  "UPSS". . . 

##  {"ID":lj  "Day " : 8,  " Tx":0 t  "SeifEff":33}  "SeifEff25":8J  "UPSS". . . 

##  {"ID" :lj  "Day": 9,  "Tx":l}  "SelfEff":33}  "SeLfEff25":8J  "UPSS".  .  . 

##  {"ID" :lj  "Day" :  10 j  "Tx":l}  "SeLfEff":33}  "SeLfEff25":8J  "UPSS". . . 

##  ... 

We  can  see  that  JSON  objects  are  very  simple.  The  data  structure  is  organized 
using  hierarchies  marked  by  square  brackets.  Each  piece  of  information  is  formatted 
as  a  { key :  value }  pair. 

The  package  j  sonlite  is  a  very  useful  tool  to  import  online  JSON  formatted 
datasets  into  data  frame  directly.  Its  syntax  is  very  straight-forward. 


#install . packages (" j sonlite" ) 

L ibrary (jsoniite) 
nofl_Lite<- 

f rom JSON  ("https :  //umich .  instructure.  com/fiLes/1760327/downLoad?doiAjnLoacl_frd=l ") 
class (nofl_Lite) 

##  [1]  "data .frame" 


16.1.8  Reading  and  Writing  Microsoft  Excel  Spreadsheets 
Using  XLSX 

We  can  transfer  a  xlsx  dataset  into  CSV  and  use  read .  csv  ( )  to  load  this  kind  of 
dataset.  However,  R  provides  an  alternative  read.xlsxO  function  in  package 
xl  sx  to  simplify  this  process.  Take  our  0  2_Nof  l_Dat  a  .  xl  s  data  in  the  class  file 
as  an  example.  We  need  to  download  the  file  first. 


#  install. packages("xlsx") 
library (xlsx) 

nofl  <  -  read .  x l sx (  "C :  / Users/ Fo  Lder/ 02_Nofl .  x L sx  "j  1 ) 
str(nofl ) 

##  'data. frame' :  900  obs.  of  10  variables : 

##  $  ID  :  num  1111111111... 

##  $  Day  :  num  123456789  10... 

##  $  Tx  :  num  1100110011... 
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##  $  SeLfEff  :  num  33  33  33  33  33  33  33  33  33  33  ... 

##  $  SeLfEff 25:  num  8888888888... 

##  $  IaIPSS  :  num  0.97  -  0.17  0.81  -0.41  0.59  -1.16  0.3  -  0.34  -  0.74  -  0.38 

•  •  • 

##  $  SocSuppt  :  num  5  3.87  4.84  3.62  4.62  2.87  4.33  3.69  3.29  3.66  ... 

##  $  PMss  :  num  4.03  4.03  4.03  4.03  4.03  4.03  4.03  4.03  4.03  4.03  ... 

##  $  PMss 3  :  num  1.03  1.03  1.03  1.03  1.03  1.03  1.03  1.03  1.03  1.03  ... 

##  $  PhyAct  :  num  53  73  23  36  21  0  21  0  73  114  ... 


The  last  argument,  1,  stands  for  the  first  excel  sheet,  as  any  excel  file  may  include 
a  large  number  of  tables  in  it.  Also,  we  can  download  the  xls  or  xlsx  file  into  our 
R  working  directory  so  that  it  is  easier  to  find  the  file  path. 

Sometimes  more  complex  protocols  may  be  necessary  to  ingest  data  from  XLSX 
documents.  For  instance,  if  the  XLSX  doc  is  large,  includes  many  tables  and  is  only 
accessible  via  HTTP  protocol  from  a  web-server.  Below  is  an  example  of 
downloading  the  second  table,  ABIDE_Aggregated_Data,  from  the  multi¬ 
table  Autism/ABIDE  XLSX  dataset: 


#  install. packages("openxlsx");  libnany(openxlsx) 
tmp  =  tempfiLe(fiLeext  =  ".xLsx") 

download,  file (urL  =  "https : //umich . instructure.com/fiLes/3225493/downLoad?do 
wnLoad_frd=l ", 

destfiLe  =  tmp}  mode="wb" )  df_Autism  <-  openxLsx: : read.xLsx(xLsxFiLe  =  tmp} 
sheet  =  "ABIDE_Aggregated_Data"j  skipEmptyRows  =  TRUE) 
dim( df_Autism ) 

##  [1]  1098  2145 


16.2  Working  with  Domain-Specific  Data 

Powerful  machine-learning  methods  have  already  been  applied  in  many  applica¬ 
tions.  Some  of  these  techniques  are  very  specialized  and  some  applications  require 
unique  approaches  to  address  the  corresponding  challenges. 


16.2.1  Working  with  Bioinformatics  Data 

Genetic  data  are  stored  in  widely  varying  formats  and  usually  have  more  feature 
variables  than  observations.  They  could  have  1,000  columns  and  only  200  rows.  One 
of  the  commonly  used  pre-processng  steps  for  such  datasets  is  variable  selection.  We 
will  talk  about  this  in  Chap.  17. 

The  Bioconductor  project  created  powerful  R  functionality  (packages  and  tools) 
for  analyzing  genomic  data,  see  Bioconductor  for  more  detailed  information. 


528 


16  Specialized  Machine  Learning  Topics 


1 6.2.2  Visualizing  Network  Data 

Social  network  data  and  graph  datasets  describe  the  relations  between  nodes  (verti¬ 
ces)  using  connections  (links  or  edges)  joining  the  node  objects.  Assume  we  have 
N  objects,  we  can  have  N  *  (N  —  1)  directed  links  establishing  paired  associations 
between  the  nodes.  Let’s  use  an  example  with  N—4  to  demonstrate  a  simple  graph 
potentially  modeling  the  node  linkage  Table  16.1. 

If  we  change  the  a  — >  b  to  an  indicator  variable  (0  or  1)  capturing  whether  we 
have  an  edge  connecting  a  pair  of  nodes,  then  we  get  the  graph  adjacency  matrix. 

Edge  lists  provide  an  alternative  way  to  represent  network  connections.  Every 
line  in  the  list  contains  a  connection  between  two  nodes  (objects)  (Table  16.2). 

The  edge  list  on  Table  16.2  lists  three  network  connections:  object  1  is  linked  to 
object  2;  object  1  is  linked  to  object  3;  and  object  2  is  linked  to  object  3.  Note  that 
edge  lists  can  represent  both  directed  as  well  as  undirected  networks  or  graphs. 

We  can  imagine  that  if  N  is  very  large,  e.g.,  social  networks,  the  data  represen¬ 
tation  and  analysis  may  be  resource  intense  (memory  or  computation).  In  R,  we  have 
multiple  packages  that  can  deal  with  social  network  data.  One  user-friendly  example 
is  provided  using  the  i graph  package.  First,  let’s  build  a  toy  example  and  visualize 
it  using  this  package  (Fig.  16.1). 

#install . packages ("igraph" ) 

Library ( igraph) 

g<-graph(c(lj  2,  1 ,  3 ,  2,  3 ,  3 ,  4 ),  n=10) 
pLot(g) 

Here  c(l,  2 ,  1,  3,  2,  3,  3,  4)  is  an  edge  list  with  4  rows  and  n=10 
indicates  that  we  have  10  nodes  (objects)  in  total.  The  small  arrows  in  the  graph 
show  the  directed  network  connections.  We  might  notice  that  5-10  nodes  are 
scattered  out  in  the  graph.  This  is  because  they  are  not  included  in  the  edge  list,  so 
there  are  no  network  connections  between  them  and  the  rest  of  the  network. 


Table  16.1  Schematic  matrix 
representation  of  network 
connectivity 


Objects 

1 

2 

3 

4 

1 

1  -►  2 

1  -►  3 

1  — ►  4 

2 

2  -►  1 

2^3 

2^4 

3 

3  -►  1 

3^2 

3^4 

4 

4  — >•  1 

4^2 

4^3 

Table  16.2  List-based 
representation  of  network 
connectivity 


Vertex 

Vertex 

1 

2 

1 

3 

2 

3 
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Fig.  16.1  A  simple 
example  of  a  social  network 
as  a  graph  object 


Now  let’s  examine  the  co-appearance  network  of  Facebook  circles.  The  data 
contains  anonymized  circles  (friends  lists)  from  Facebook  collected  from  survey 
participants  using  a  Facebook  app.  The  dataset  only  includes  edges  (circles,  88,234) 
connecting  pairs  of  nodes  (users,  4,039)  in  the  member  social  networks. 

The  values  on  the  connections  represent  the  number  of  links/edges  within  a  circle. 
We  have  a  huge  edge-list  made  of  scrambled  Facebook  user  IDs.  Let’s  load  this 
dataset  into  R  first.  The  data  is  stored  in  a  text  file.  Unlike  CSV  files,  text  files  in  table 
format  need  to  be  imported  using  read .  table  ( ) .  We  are  using  the  header=F 
option  to  let  R  know  that  we  don’t  have  a  header  in  the  text  file  that  contains  only 
tab-separated  node  pairs  (indicating  the  social  connections,  edges,  between 
Facebook  users). 


soc .  net .  dato< -read .  table (" https : //umich .  instructure.  com/files/2854431/doiAjnlo 
ad?downLoad_frd=l"j  sep="  ",  header=F) 
head( soc . net .data) 

##  VI  V2 
##101 
##202 
##303 
##404 
##505 
##606 

Now  the  data  is  stored  in  a  data  frame.  To  make  this  dataset  ready  for  igraph 
processing  and  visualization,  we  need  to  convert  soc .  net .  data  into  a  matrix 
object. 


soc. net. data. mat  <-  as .matrix(soc .net .data,  ncol=2) 
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By  using  ncol=2,  we  made  a  matrix  with  two  columns.  The  data  is  now  ready 
and  we  can  apply  graph .  edgelist  ( ) . 

#  remove  the  first  347  edges  (to  wipe  out  the  degenerate  "0"  node) 
graph_m< -graph. edgeList ( soc . net .data. mat [ -c(0: 347) j  ],  directed  =  F) 

Before  we  display  the  social  network  graph  we  may  want  to  examine  our  model 
first. 

summary ( graph_m ) 

##  IGRAPH  U---  4038  87887  -- 

This  is  an  extremely  brief  yet  informative  summary.  The  first  line  U -  4038 

8  7  8  87  includes  potentially  four  letters  and  two  numbers.  The  first  letter  could  be  U 
or  D  indicating  undirected  or  directed  edges.  A  second  letter  N  would  mean  that  the 
objects  set  has  a  “name”  attribute.  A  third  letter  is  for  weighted  (W)  graph.  Since  we 
didn’t  add  weight  in  our  analysis  the  third  letter  is  empty  A  fourth  character  is 
an  indicator  for  bipartite  graphs,  whose  vertices  can  be  divided  into  two  disjoint 
sets  where  each  vertex  from  one  set  connects  to  one  vertex  in  the  other  set.  The 
two  numbers  following  the  4  letters  represent  the  number  of  nodes  and  the 
number  of  edges,  respectively.  Now  let’s  render  the  graph  (Fig.  16.2). 

pLot(graph_m) 

This  graph  is  very  complicated.  We  can  still  see  that  some  words  are  surrounded 
by  more  nodes  than  others.  To  obtain  such  information  we  can  use  the  degree  ( ) 
function,  which  lists  the  number  of  edges  for  each  node. 

degree ( graph_m ) 

Skimming  the  table  we  can  find  that  the  107-th  user  has  as  many  as  1,044 
connections,  which  makes  the  user  a  highly-connected  huh.  Likely,  this  node  may 
have  higher  social  relevance. 

Some  edges  might  be  more  important  than  other  edges  because  they  serve  as  a 
bridge  to  fink  a  cloud  of  nodes.  To  compare  their  importance,  we  can  use  the 
betweenness  centrality  measurement.  Betweenness  centrality  measures  centrality  in 
a  network.  High  centrality  for  a  specific  node  indicates  influence,  betweenness 
( )  can  help  us  to  calculate  this  measurement. 

betweenness ( graph_m ) 

Again,  the  107-th  node  has  the  highest  betweenness  centrality 
(3.556221^  +  06). 
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Fig.  16.2  Social  network  connectivity  of  Facebook  users 


This  is  a  live  demo  of  the  SOCR  Resources 
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Fig.  16.3  Live  demo:  a  dynamic  graph  representation  of  the  SOCR  resources 


We  can  try  another  example  using  SOCR  hierarchical  data,  which  is  also  avail¬ 
able  for  dynamic  exploration  as  a  tree  graph.  Let’s  read  its  JSON  data  source  using 
the  j  sonlite  package  (Fig.  16.3). 
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tree . json<-fromJSON( "http ://socr .uc La. edu/SOCR_HyperTree .j son  ", 
simpLifyDataFrame  =  FALSE) 

This  generates  a  list  object  representing  the  hierarchical  structure  of  the 
network.  Note  that  this  is  quite  different  from  an  edge  list.  There  is  one  root  node, 
its  sub  nodes  are  called  children  nodes ,  and  the  terminal  nodes  are  call  leaf  nodes. 
Instead  of  presenting  the  relationship  between  nodes  in  pairs,  this  hierarchical 
structure  captures  the  level  for  each  node.  To  draw  the  social  network  graph,  we 
need  to  convert  it  as  a  Node  object.  We  can  utilize  the  as  .  Node  ( )  function  in  the 
data  .  tree  package  to  do  so. 

#  install . pac kages(" data .tree" ) 

L ibrary (data. tree ) 

tree .graph<-as.Node(tree .jsorij  mode  =  "expLicit" ) 

Here  we  use  mode=M explicit "  option  to  allow  “children”  nodes  to  have 
their  own  “children”  nodes.  Now,  the  tree  .  j  son  object  has  been  separated  into 
four  different  node  structures  -  "About  SOCR" ,  "SOCR  Resources" ,  "Get 
Started" ,  and  "SOCR  Wiki".  Let’s  plot  the  first  one  using  igraph  package 
(Fig.  16.4). 


(MasboP 


Workshop 


£ion 


ationali  ration 
hmission  Guidelines 


Source  G 


SOCR  Mirrors  (kM^Wmblic  minors  only) 


Past,  Current 

SOCR/Ci 
SOCR  sJ 


SOCR  Modeler 

see 

SOCR  Translatio 
SOCR  Project 


Fig.  16.4  The  SOCR  resourceome  network  plotted  as  a  static  R  graph 
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plot (as. igraph(tree.graph$' About  SOCR')j  edge .arrow. size=5j  edge .  Label .font= 
0.05) 

In  this  graph,  "About  SOCR" ,  which  is  located  at  the  center,  represents  the  root 
node  of  the  tree  graph. 


16.3  Data  Streaming 

The  proliferation  of  Cloud  services  and  the  emergence  of  modern  technology  in  all 
aspects  of  human  experiences  leads  to  a  tsunami  of  data  much  of  which  is  streamed 
real-time.  The  interrogation  of  such  voluminous  data  is  an  increasingly  important 
area  of  research.  Data  streams  are  ordered,  often  unbounded  sequences  of  data 
points  created  continuously  by  a  data  generator.  All  of  the  data  mining,  interrogation 
and  forecasting  methods  we  discuss  here  are  also  applicable  to  data  streams. 


16.3.1  Definition 

Mathematically,  a  data  stream  in  an  ordered  sequence  of  data  points 

Y  =  {yi,^^ 

where  the  (time)  index,  t,  reflects  the  order  of  the  observation/record,  which  may  be 
single  numbers,  simple  vectors  in  multidimensional  space,  or  objects,  e.g.,  structured 
Ann  Arbor  Weather  (JSON)  and  its  corresponding  structured  form.  Some  streaming 
data  is  streamed  because  it’s  too  large  to  be  downloaded  shotgun  style  and  some  is 
streamed  because  it’s  continually  generated  and  serviced.  This  presents  the  potential 
problem  of  dealing  with  data  streams  that  may  be  unlimited. 

Notes: 

•  Data  sources’.  Real  or  synthetic  stream  data  can  be  used.  Random  simulation 
streams  may  be  created  by  rstream.  Real  stream  data  may  be  piped  from 
financial  data  providers,  the  WHO,  World  Bank,  NCAR  and  other  sources. 

•  Inference  Techniques :  Many  of  the  data  interrogation  techniques  we  have  seen 
can  be  employed  for  dynamic  stream  data,  e.g.,  fact  as,  for  PCA,  rEMM  and 
birch  for  clustering,  etc.  Clustering  and  classification  methods  capable  of 
processing  data  streams  have  been  developed,  e.g.,  Very  Fast  Decision  Trees 
(VFDT),  time  window-based  Online  Information  Network  (OLIN),  On-demand 
Classification ,  and  the  APRIORI  streaming  algorithm. 

•  Cloud  distributed  computing :  Hadoop2/HadoopStreaming,  SPARK,  Storm3/ 
RStorm  provide  an  environments  to  expand  batch/script-based  R  tools  to  the 
Cloud. 
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16.3.2  The  stream  Package 

The  R  stream  package  provides  data  stream  mining  algorithms  using  f pc,  clue, 
cluster,  clusterGeneration,  MASS,  and  proxy  packages.  In  addition,  the 
package  streamMOA  provides  an  rJava  interface  to  the  Java-based  data  stream 
clustering  algorithms  available  in  the  Massive  Online  Analysis  (MO A)  framework 
for  stream  classification,  regression  and  clustering. 

If  you  need  a  deeper  exposure  to  data  streaming  in  R,  we  recommend  you  go  over 
the  stream  vignettes. 


16.3.3  Synthetic  Example:  Random  Gaussian  Stream 

This  example  shows  the  creation  and  loading  of  a  mixture  of  5  random  2D  Gauss- 
ians ,  centers  at  (. x_coords ,  y_coords)  with  paired  correlations  rho_corr,  representing 
a  simulated  data  stream. 

Generate  the  stream: 

#  install. packages("stream") 

L ibrary ( "s tream ") 

x_coords  <-  c(0. 2j0. 3j  0.5}  0.8}  0.9) 
y_coords  <-  c(0.8j0.3j  0.7 }  0.1 j  0.5) 

p_weight  <-  c(0.1j  0.9 j  0.5}  0.4 ,  0.3)  #  A  vector  of  probabilities  that  dete 
rmines  the  likelihood  of  generated  a  data  point  from  a  particular 
cluster  set.seed(12345) 

stream_5G  <-  DSD_Gaussions(k  =  5}  d  =  2}  mu=cbind(x_coordSj  y_coords)j 
p=p_weight) 


k-Means  Clustering 

We  will  now  try  k-means  and  density-based  data  stream  clustering  algorithm, 
D-Stream,  where  micro-clusters  are  formed  by  grid  cells  of  size  gridsize  with  density 
of  a  grid  cell  (Cm)  is  least  1 .2  times  the  average  cell  density.  The  model  is  updated 
with  the  next  500  data  points  from  the  stream. 

ds tream  <-  DSC_DStream( gridsize  =  .1}  Cm  =  1.2) 
update(dstreamj  stream_5Gj  n  =  500) 

First,  let’s  run  the  k-means  clustering  with  k  =  5  clusters  and  plot  the  resulting 
micro-  and  macro-clusters  (Fig.  16.5). 
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Fig.  16.5  Micro  and  macro  clusters  of  a  5 -means  clustering  of  the  first  500  points  of  the  streamed 
simulated  2D  Gaussian  kernels 
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kmc  <-  DSC_Kmeans(k  =  5) 
recLuster(kmCj  dstream) 

plot (kmc j  stream_5Gj  type  =  "both"j  xLab="X-axis" ,  yLab="Y-axis" ) 

In  this  clustering  plot,  micro-clusters  are  shown  as  circles  and  macro -clusters  are 
shown  as  crosses  and  their  sizes  represent  the  corresponding  cluster  weight 
estimates. 

Next  try  the  density-based  data  stream  clustering  algorithm  D-Stream.  Prior  to 
updating  the  model  with  the  next  1 ,000  data  points  from  the  stream,  we  specify  the 
grid  cells  as  micro-clusters,  grid  cell  size  (gridsize=0.1),  and  a  micro-cluster 
(Cm=1.2)  that  specifies  the  density  of  a  grid  cell  as  a  multiple  of  the  average  cell 
density. 


dstream  <-  DSC_DStream(gridsize  =0.1,  Cm  =  1.2) 
update(dstreamj  stream_5Gj  n=1000) 

We  can  re-cluster  the  data  using  k-means  with  5  clusters  and  plot  the  resulting 
micro-  and  macro- clusters  (Fig.  16.6). 


km_G5  <-  DSC_Kmeans(k  =  5) 
recLuster(km_G5j  dstream) 
pLot(km_G5j  stream_5Gj  type  =  "both") 

Note  the  subtle  changes  in  the  clustering  results  between  kmc  and  km_G5. 
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Fig.  16.6  Micro-  and  macro-  clusters  of  a  5-means  clustering  of  the  next  1,000  points  of  the 
streamed  simulated  2D  Gaussian  kernels 


16.3.4  Sources  of  Data  Streams 

Static  Structure  Streams 

•  DSD_BarsAndGaussians  generates  two  uniformly  filled  rectangular  and  two 
Gaussian  clusters  with  different  density. 

•  DSDjGaussians  generates  randomly  placed  static  clusters  with  random  multi¬ 
variate  Gaussian  distributions. 

•  DSD_mlbenchData  provides  streaming  access  to  machine  learning  benchmark 
data  sets  found  in  the  mlbench  package. 

•  DSD_mlbenchGenerator  interfaces  the  generators  for  artificial  data  sets  defined 
in  the  mlbench  package. 

•  DSD_Target  generates  a  ball  in  circle  data  set. 

•  DSD_UniformNoise  generates  uniform  noise  in  a  d-dimensional  (hyper)  cube. 


Concept  Drift  Streams 

•  DSD -Benchmark  provides  a  collection  of  simple  benchmark  problems  including 
splitting  and  joining  clusters,  and  changes  in  density  or  size,  which  can  be  used  as 
a  comprehensive  benchmark  set  for  algorithm  comparison. 

•  DSD_MG  is  a  generator  to  specify  complex  data  streams  with  concept  drift.  The 
shape  as  well  as  the  behavior  of  each  cluster  over  time  can  be  specified  using 
keyframes. 

•  DSD_RandomRBFGeneratorEvents  generates  streams  using  radial  base  func¬ 
tions  with  noise.  Clusters  move,  merge  and  split. 
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Real  Data  Streams 

•  DSD _Memory  provides  a  streaming  interface  to  static,  matrix-like  data  (e.g.,  a 
data  frame,  a  matrix)  in  memory  which  represents  a  fixed  portion  of  a  data  stream. 
Matrix-like  objects  also  include  large  objects  potentially  stored  on  disk  like  f  f  :  : 
ffdf. 

•  DSD_ReadCSV  reads  data  line  by  line  in  text  format  from  a  file  or  an  open 
connection  and  makes  it  available  in  a  streaming  fashion.  This  way  data  that  is 
larger  than  the  available  main  memory  can  be  processed. 

•  DSD_ReadDB  provides  an  interface  to  an  open  result  set  from  a  SQL  query  to  a 
relational  database. 


16.3.5  Printing,  Plotting  and  Saving  Streams 

For  DSD  objects,  some  basic  stream  functions  include  print  ( ) ,  plot  ( ) ,  and 
write_stream  ( ) .  These  can  save  part  of  a  data  stream  to  disk.  DSD_Memory 
and  DSD_ReadCSV  objects  also  include  member  functions  like  reset_stream  ( ) 
to  reset  the  position  in  the  stream  to  its  beginning. 

To  request  a  new  batch  of  data  points  from  the  stream  we  use  get_points  ( ) . 
This  chooses  a  random  cluster  (based  on  the  probability  weights  in  p  weight)  and 
a  point  is  drawn  from  the  multivariate  Gaussian  distribution  (mean  =  mu,  covariance 
matrix  =  E)  of  that  cluster.  Below,  we  pull  n  =  10  new  data  points  from  the  stream 
(Fig.  16.7). 


Fig.  16.7  Scatterplot  of  the  next  batch  of  700  random  Gaussian  points  in  2D 
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nevj_p 

<-  get_points(stream_ 

5Gj  n 

ne\A/_p 

## 

XI 

X2 

##  1 

0.4017803 

0.2999017 

##  2 

0.4606262 

0.5797737 

##  3 

0.4611642 

0.6617809 

##  4 

0.3369141 

0.2840991 

##  5 

0.8928082 

0.5687830 

##  6 

0.8706420 

0.4282589 

##  7 

0.2539396 

0.2783683 

##  8 

0.5594320 

0.7019670 

##  9 

0.5030676 

0. 7560124 

##  10 

0.7930719 

0.0937701 

ne\A/_p 

<-  get_points( stream 

5Gj  n 

head(new_pj  n  = 

20) 

## 

XI 

X2  cLass 

##  1 

0.7915730 

0.09533001 

4 

##  2 

0.4305147 

0.36953997 

2 

##  3 

0.4914093 

0.82120395 

3 

##  4 

0.7837102 

0.06771246 

4 

##  5 

0.9233074 

0.48164544 

5 

##  6 

0.8606862 

0.49399269 

5 

##  7 

0.3191884 

0.27607324 

2 

##  S 

0.2528981 

0.27596700 

2 

##  9 

0.6627604 

0.68988585 

3 

##  10 

0. 7902887 

0.09402659 

4 

##  11 

0.7926677 

0.09030248 

4 

##  12 

0.9393515 

0.50259344 

5 

##  13 

0.9333770 

0.62817482 

5 

##  14 

0.7906710 

0.10125432 

4 

##  15 

0.1798662 

0.24967850 

2 

##  16 

0.  7985790 

0.08324688 

4 

##  17 

0.5247573 

0.57527380 

3 

##  IS 

0.2358468 

0.23087585 

2 

##  19 

0.8818853 

0.49668824 

5 

##  20 

0.4255094 

0.81789418 

3 

pLot( stream_5Gj 

n  =  700 j 

method 

10) 


100 ,  cLass  =  TRUE) 


pc") 


Note  that  if  you  add  noise  to  your  stream,  e.g.,  stream_Noise  <- 
DSD_Gaussians  (k  =  5 ,  d  =  4 ,  noise  =  .1,  p  =  c  ( 0 . 1 ,  0.5,  0.3, 
0.9,  0.1)),  then  the  noise  points  that  are  not  classified  as  part  of  any  cluster  will 
have  an  NA  class  label. 


16.3.6  Stream  Animation 

Clusters  can  be  animated  over  time  by  animate_data  ( ) .  Use  reset_stream 
( )  to  start  the  animation  at  the  beginning  of  the  stream  and  note  that  this  method  is 
not  implemented  for  streams  of  class  DSD_Gaussians,  DSD_R,  DSD_data. 
frame,  and  DSD.  We’ll  create  a  new  DSD_Benchmark  data  stream  (Fig.  16.8). 
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Fig.  16.8  Discrete  snapshots  of  the  animated  stream  clustering  process 


set.seed(12345) 

stream_Bench  <-  DSD_Benchmark(l) 
stream_Bench 

##  Benchmark  1:  Two  c Lusters  moving  diagonaLLy  from  Left  to  right j  meeting 
in 

##  the  center  (5%  noise). 

##  CLass:  DSD_MGj  DSD_RJ  DSD_data .frame ,  DSD 
##  With  2  c Lusters  in  2  dimensions .  Time  is  1 
L ibrary ( " an imation ") 
reset_stream( stream_Bench ) 

animate_data( stream_Benchj n=10000} horizon=100j xLim=c(0j 1) }  yLim=c(0} 1)) 

This  benchmark  generator  creates  two  2D  clusters  moving  in  2D.  One  moves 
from  top-left  to  bottom-right ,  the  other  from  bottom-left  to  top-right.  Then  they  meet 
at  the  center  of  the  domain,  the  2  clusters  overlap  and  then  split  again. 

Concept  drift  in  the  stream  can  be  depicted  by  requesting  (10)  times  300  data 
points  from  the  stream  and  animating  the  plot.  Fast-forwarding  the  stream  can  be 
accomplished  by  requesting,  but  ignoring,  (2000)  points  in  between  the  (10)  plots. 
The  output  of  the  animation  below  is  suppressed  to  save  space. 


for(i  in  1:10)  { 

pLot(stream_Benchj  300 }  xLim  =  c(0j  1) }  yLim  =  c(0,  1)) 
tmp  <-  get_points(stream_Benchj  n  =  2000) 


} 
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reset_streom( stream_Bench ) 

animate_data( stream_Benchj n=8000} horizon=120j  xLim=c(0j 1) ,  yLim=c(0j 1)) 

#  Animations  can  be  saved  as  HTML  on  GIF 

#saveHTML(ani . replay( ) ,  htmlfile  =  "stneam_Bench_Animation.html") 
#saveGIF(ani. replay () ) 

Streams  can  also  be  saved  locally  by  write_stream  ( stream_Bench, 
"dataStreamSaved.  csv" ,  n=100/  sep=",  ")  and  loaded  back  in  R  by 
DSD  ReadCSV ( )  . 


16.3.7  Case-Study:  SOCR  Knee  Pain  Data 

These  data  represent  the  X  and  Y  spatial  knee-pain  locations  for  over  8,000  patients, 
along  with  labels  about  the  knee  Front,  Lack,  Left  and  Fight.  Let’s  try  to  read  the 
SOCR  Knee  Pain  Dataset  as  a  stream. 

Library ("XML" ) ;  Library ( "xmL2" ) ;  Library ("rye st" ) 

wiki_urL  <-  read_htmL (" http :/ /wiki . socr .  umich.edu/index.php/SOCR_Data_KneePa 
inData_041409 ") 

htmL_nodes ( wiki_urL}  "#content ") 

##  {xmL_nodeset  ( 1 )} 

##  [1]  <div  id=" content"  c Lass="mw-body -primary "  roLe="main">\n\t<a  id="top 


kneeRawData  <-  htmL_tabLe(htmL_nodes(wiki_urLj  "tabLe" ) [ [2] ] ) 
normaLize< -function (x) { 

return ( (x-min(x) ) / (max(x) -min(x) ) ) 

} 

kneeRawData_df  <-  as .data. frame (cbind(normaLize (kneeRawData$x) , 
normaLize(kneeRawData$Y) j  as . factor (kneeRawData$View) ) ) 
coLnames(kneeRawData_df)  <-  c("X" ,  " Y ",  "LabeL") 

#  randomize  the  rows  of  the  DF  as  RF,  REL  LF  and  LB  labels  of  classes  are 

sequential 

set.seed(1234) 

kneeRawData_df  <-  kneeRawData_df[sampLe(nrow(kneeRawData_df))j  ] 
summary (kneeRawData_df) 


## 

X 

Y 

LabeL 

## 

Min. 

: 0 . 0000 

Min. 

: 0 . 0000 

Min. 

: 1.000 

## 

1st  Qu. 

: 0.1331 

1st  Qu. 

: 0.4566 

1st  Qu. :2.000 

## 

Median 

: 0.2995 

Median 

: 0.5087 

Median  : 3.000 

## 

Mean 

:0.3382 

Mean 

: 0.5091 

Mean 

: 2 . 801 

## 

3rd  Qu. 

: 0.3645 

3rd  Qu. 

: 0.5549 

3rd  Qu. :4.000 

## 

Max. 

: 1 . 0000 

Max. 

: 1 . 0000 

Max. 

:4.000 

#  View(kneeRawData_df ) 

We  can  use  the  DSD  :  :  DSD_Memory  class  to  get  a  stream  interface  for  matrix  or 
data  frame  objects,  like  the  Knee  pain  location  dataset.  The  number  of  true  clusters 
k  =  4  in  this  dataset. 
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#  use  data. frame  to  create  a  stream  (3rd  column  contains  label  assignment) 
hneeDF  <-  data  .frame  (x=hneeRaiAjData_df  [  j  1]  j  y=hneeRaiAjData_df[j2]j 
ciass=as.  factor (kneeRawData_df[j 3])) 
head(kneeDF) 

##  x  y  cLass 

##  1  0.1188590  0.5057803  4 

##  2  0.3248811  0.6040462  2 

##  3  0.3153724  0.4971098  2 

##  4  0.3248811  0.4161850  2 

##  5  0.6941363  0.5289017  1 

##  6  0.3217116  0.4595376  2 

streamKnee  <-  DSD_Memory (hneeDF [ j c( "x” j  "y")]}  cLass=hneeDF [ ,  "class" ] , 

Loop=T) 

streamKnee 


##  Memory  Stream  Interface 

##  CLass:  DSD_Memoryj  DSD_Rj  DSD_data. frame ,  DSD 
##  With  NA  c Lusters  in  2  dimensions 

##  Contains  8666  data  points  -  currentLy  at  position  1  -  Loop  is  TRUE 

#  Each  time  we  get  a  point  from  *streamKnee* ,  the  stream  pointer  moves 
to  the  next  position  (row)  in  the  data. 
get_points( streamKnee j  n=10) 


##  x 
##  1  0.11885895 
##  2  0.32488114 
##  3  0.31537242 
##  4  0.32488114 
##  5  0.69413629 
##  6  0.32171157 
##  7  0.06497623 
##  8  0.12519810 
##  9  0.32329635 
##10  0.30744849 


y 

0.5057803 
0. 6040462 
0.4971098 
0.4161850 
0.5289017 
0.4595376 
0.4913295 
0.4682081 
0.4942197 
0.5086705 


streamKnee 

##  Memory  Stream  Interface 

##  CLass:  DSD_Memory}  DSD_Rj  DSD_data. frame }  DSD 
##  With  NA  cLusters  in  2  dimensions 

##  Contains  8666  data  points  -  currentLy  at  position  11  -  Loop  is  TRUE 


#  Stream  pointer  is  in  position  11  now 


#  We  can  redirect  the  current  position  of  the  stream  pointer  by: 
reset_stream(streamKneej  pos  =  200) 

get_points( streamKnee j  n=10) 


##  x 
##  200  0.9413629 
##  201  0.3217116 
##  202  0.3122029 
##  203  0.1553090 
##  204  0.3645008 
##  205  0.3122029 
##  206  0.3549921 
##  207  0.1473851 
##  208  0.1870048 
##  209  0.1220285 


y 

0.5606936 
0.5664740 
0.6416185 
0. 6040462 
0.5346821 
0. 5000000 
0.5404624 
0.5260116 
0.6329480 
0.4132948 
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streamKnee 

##  Memory  Stream  Interface 

##  CLass:  DSD_Memoryj  DSD_R}  DSD_data. frame j  DSD 
##  With  NA  c Lusters  in  2  dimensions 

##  Contains  8666  data  points  -  currently  at  position  210  -  Loop  is  TRUE 


16.3.8  Data  Stream  Clustering  and  Classification  (DSC) 

Let’s  demonstrate  clustering  using  DSC_DStream,  which  assigns  points  to  cells  in 
a  grid.  First,  initialize  the  clustering,  as  an  empty  cluster  and  then  use  the  update  ( ) 
function  to  implicitly  alter  the  mutable  DSC  object  (Fig.  16.9). 

dsc_streamKnee  <-  DSC_DStream(gridsize  =  0.1}  Cm  =  0.4}  attractions) 
dsc_streamKnee 

##  DStream 

##  CLass:  DSC_DStreamJ  DSC_Micro ,  DSC_R ,  DSC 
##  Number  of  micro-clusters :  0 
##  Number  of  macro-clusters :  0 

#  stream: : update 

reset_stream(streamKneej  pos  =  1) 

update (dsc_streamKneej  streamKneej  n  =  500) 

dsc_streamKnee 

##  DStream 

##  CLass:  DSC_DStreamJ  DSC_Micro ,  DSC_R ,  DSC 
##  Number  of  micro-clusters :  16 
##  Number  of  macro-clusters :  11 


Fig.  16.9  Data  stream  clustering  and  classification  of  the  SOCR  knee-pain  dataset  (n=500) 
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Fig.  16.10  5  -Means  stream  clustering  of  the  SOCR  knee  pain  data 


head(get_centers ( dsc_streamKnee ) ) 

##  [,1]  [j2] 

##  [lj]  0.05  0.45 
##  [2/]  0.05  0.55 
##  [3j ]  0.15  0.35 
##  [4/]  0.15  0.45 
##  [5/]  0.15  0.55 
##  [6/]  0.15  0.65 

pLot(dsc_streamKneej  streamKneej  xLim=c(0jl)j  yiim=c(0j 1) ) 


#  plot (dsc_streamKneej  stneamKnee,  grid  =  TRUE) 

#  Micro-clusters  are  plotted  in  red  on  top  of  gray  stream  data  points 

#  The  size  of  the  micro-clusters  indicates  their  weight  -  it's  proportional 
to  the  number  of  data  points  represented  by  each  micro-cluster . 

#  Micro-clusters  are  shown  as  dense  grid  cells  (density  is  coded  with  gray 
values) . 

The  purity  metric  represents  an  external  evaluation  criterion  of  cluster 
quality,  which  is  the  proportion  of  the  total  number  of  points  that  were  correctly 
classified: 


0  <  Purity 


where  A/=number  of  observed  data  points,  k  =  number  of  clusters,  ct  is  the  i  cluster, 
and  tj  is  the  classification  that  has  the  maximum  number  of  points  with  ct  class  labels. 
High  purity  suggests  that  we  correctly  label  points  (Fig.  16.10). 
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Next,  we  can  use  K-means  clustering. 


kMeans_Knee  <-  DSC_Kmeans(k=5)  #  use  4-5  clusters  matching  the  4  knee  labels 
rec Luster (kMeans_Kneej  dsc_streamKnee) 
piot(kMeons_Kneej  streamKneej  type  =  "both") 

Again,  the  graphical  output  of  the  animation  sequence  of  frames  is  suppressed, 
however,  the  readers  are  encouraged  to  run  the  command  line  and  inspect  the 
graphical  outcome  (Figs.  16.11  and  16.12). 


Fig.  16.11  Animated  continuous  5 -means  stream  clustering  of  the  knee  pain  data 


Fig.  16.12  Continuous  stream  clustering  and  purity  index  across  iterations 
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animate_dato(streomKneej  n=1000}  horizon=100JxLim=c(0Jl)J  yLim  =  c(0,l)) 


#  purity  <-  animate_cluster(kMeans_Knee,  streamKnee,  n=2500,  type="both", 
xlim=c(0,l),  ylim=c(-,l),  evaluationMeasure="purity",  horizon=10) 

animate_c Luster (kMeans_Kneej  streamKnee ,  horizon  =  100 }  n  =  5000 , 
measure  =  "purity" ,  piot.args  =  List(xLim  =  c(0}  l)j  yLim  =  c(0}  1))) 


## 

points 

purity 

## 

1 

1 

0. 9600000 

## 

2 

101 

0.9043478 

## 

3 

201 

0. 9500000 

## 

49 

4801 

0.9047619 

## 

50 

4901 

0.8850000 

16.3.9  Evaluation  of  Data  Stream  Clustering 


Figure  16.13  shows  the  average  clustering  purty  as  we  evaluate  the  stream  clustering 
across  the  streaming  points. 

#  Synthetic  Gaussian  example 

#  stream  <-  DSD_Gaussians(k  =  3,  d  =  2,  noise  =  .05) 

#  dstream  <-  DSC_DStream(gridsize  =  .1) 

#  update(dstream,  stream,  n  =  2000) 

#  evaluate(dstream,  stream,  n  =  100) 

e\zaLuate(dsc_streamKneeJ  streamKnee ,  measure  =  c("crand" ,  "SSQ"j 
"siLhouette") ,  n  =  100 ,  type  =  c("auto"j  "micro" , "macro" ) }  assign="micro" , 
assignmentMethod  =  c("auto"j  "modeL"j  "nn")j  noise  =  c(”cLass" , "excLude" )) 

##  EvaLuation  resuLts  for  micro-cLusters . 

##  Points  were  assigned  to  micro-cLusters . 

##  cRand  SSQ  siLhouette 

##  0.3473634  0.3382900  0.1373143 

cLusterEvaL  <-  evaLuate_cLuster(dsc_streamKneej  streamKneej  measure  = 
c( "numMicroCLusters" j  "purity ") ,  n  =  5000 ,  horizon  =  100) 
head( cLusterEvaL ) 


## 

points 

numMicroC Lusters 

purity 

## 

1 

0 

16 

0.9555556 

## 

2 

100 

17 

0.9733333 

## 

3 

200 

18 

0.9671053 

## 

4 

300 

21 

0.9687500 

## 

5 

400 

21 

0.9880952 

## 

6 

500 

22 

0.9750000 

pLot(cLusterEvaL[  ,  "points" j,  cLusterEvaL[  ,  "purity"] ,  type  =  "L"} 
yLab  =  "Avg  Purity ",  xLab  =  "Points") 


animate_cLuster(dsc_streamKneej  streamKneej  horizon  =  100 ,  n  =  5000 , 

measure  =  "purity" ,  pLot.args  =  List(xLim  =  c(0}  1) ,  yLim  =  c(0}  1))) 
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Fig.  16.13  Average  clustering  purity 


##  points  purity 
##  1  1  0. 9714286 
##  2  101  0.9833333 
##  3  201  0.9722222 


##  49  4801  0.9772727 

##  50  4901  0.9777778 

The  dsc_streamKnee  represents  the  result  of  the  stream  clustering,  where  n  is 
the  number  of  data  points  from  the  streamKnee  stream.  The  evaluation  measure 
can  be  specified  as  a  vector  of  character  strings.  Points  are  assigned  to  clusters  in 
dsc_streamKnee  using  get_assignment  ()  and  can  be  used  to  assess  the 
quality  of  the  classification.  By  default,  points  are  assigned  to  micro-clusters ,  or  can 
be  assigned  to  macro-cluster  centers  by  assign  =  "macro " .  Also,  new  points  can 
be  assigned  to  clusters  by  the  rule  used  in  the  clustering  algorithm  by 
assignmentMethod  =  "model"  or  using  nearest-neighbor  assignment  (nn), 
Fig.  16.14. 


16.4  Optimization  and  Improving  the  Computational 
Performance 

Here  and  in  previous  chapters,  e.g.,  Chap.  15,  we  notice  that  R  may  sometimes  be 
slow  and  memory-inefficient.  These  problems  may  be  severe,  especially  for 
datasets  with  millions  of  records  or  when  using  complex  functions.  There  are 
packages  for  processing  large  datasets  and  memory  optimization  -  bigmemory, 
biganalytics,  bigtabulate,  etc. 
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Fig.  16.14  Continuous  k-means  stream  clustering  with  classificaiton  purity 


16.4.1  Generalizing  Tabular  Data  Structures  with  dplyr 


We  have  also  seen  long  execution  times  when  running  processes  that  ingest,  store  or 
manipulate  huge  data,  frame  objects.  The  dplyr  package,  created  by  Hadley 
Wickham  and  Romain  Francoi,  provides  a  faster  route  to  manage  such  large  datasets 
in  R.  It  creates  an  object  called  tbl,  similar  to  data,  frame,  which  has  an 
in-memory  column-like  structure.  R  reads  these  objects  a  lot  faster  than  data  frames. 

To  make  a  tbl  object  we  can  either  convert  an  existing  data  frame  to  tbl  or 
connect  to  an  external  database.  Converting  from  data  frame  to  tbl  is  quite  easy.  All 
we  need  to  do  is  call  the  function  as  .  tbl  ( ) . 


#install . packages ( "dplyr" ) 
Library(dpLyr) 

nofl_ tbi<-as ,tbi( nofl ) ;  nofl_ tb L 


###  A  tibbLe:  900  x  10 


## 

ID 

Day 

Tx  SeLfEff  SeLfEff25 

IaIPSS  SocSuppt 

PMss 

PMss  3 

PhyAct 

## 

<dbi> 

<dbi> 

<dbL> 

<dbi> 

<dbi> 

<dbi> 

<dbi> 

<dbi> 

<dbi> 

<dbi> 

## 

1 

1 

1 

1 

33 

8 

0.97 

5.00 

4.03 

1.03 

53 

## 

2 

1 

2 

1 

33 

8 

-0.17 

3.87 

4.03 

1.03 

73 

## 

3 

1 

3 

0 

33 

8 

0.81 

4.84 

4.03 

1.03 

23 

##  8 

1 

8 

0 

33 

8 

-0.34 

3.69 

4.03 

1.03 

0 

##  9 

1 

9 

1 

33 

8 

-0.74 

3.29 

4.03 

1.03 

73 

##  10 

1 

10 

1 

33 

8 

-0.38 

3.66 

4.03 

1.03 

114 

##  #  ...  with  890  more  rows 

This  looks  like  a  normal  data  frame.  If  you  are  using  R  Studio,  displaying  the 
nof l_tbl  will  show  the  same  output  as  nofl. 
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1 6.4.2  Making  Data  Frames  Faster  with  Data.  Table 

Similar  to  tbl,  the  data,  table  package  provides  another  alternative  to  data 
frame  object  representation,  data  .table  objects  are  processed  in  R  much  faster 
compared  to  standard  data  frames.  Also,  all  of  the  functions  that  can  accept  data 
frame  could  be  applied  to  data  .  table  objects  as  well.  The  function  f  read  ( )  is 
able  to  read  a  local  CSV  file  directly  into  a  data .  table. 

#install . packages ( "data .table" ) 

L ibrory (data. tab Le) 

nofl  <  -fread  (  "C :  /Users/Dinov/Desktop/02_Nofl_Data .  cs  v  ") 

Another  amazing  property  of  data,  table  is  that  we  can  use  subscripts  to 
access  a  specific  location  in  the  dataset  just  like  dataset  [row,  column]  .  It  also 
allows  the  selection  of  rows  with  Boolean  expression  and  direct  application  of 
functions  to  those  selected  rows.  Note  that  column  names  can  be  used  to  call  the 
specific  column  in  data,  table,  whereas  with  data  frames,  we  have  to  use  the 
dataset $columnName  syntax. 

nofl[ID==lj  mean(PhyAct) ] 

##  [1]  52.66667 

This  useful  functionality  can  also  help  us  run  complex  operations  with  only  a  few 
lines  of  code.  One  of  the  drawbacks  of  using  data  .table  objects  is  that  they  are 
still  limited  by  the  available  system  memory. 


16.4.3  Creating  Disk-Based  Data  Frames  with  ff 

The  f  f  (fast-files)  package  allows  us  to  overcome  the  RAM  limitations  of  finite 
system  memory.  For  example,  it  helps  with  operating  datasets  with  billions  of  rows, 
f  f  creates  objects  in  f  f  df  formats,  which  is  like  a  map  that  points  to  a  location  of 
the  data  on  a  disk.  However,  this  makes  ffdf  objects  inapplicable  for  most  R 
functions.  The  only  way  to  address  this  problem  is  to  break  the  huge  dataset  into 
small  chunks.  After  processing  a  batch  of  these  small  chunks,  we  have  to  combine 
the  results  to  reconstruct  the  complete  output.  This  strategy  is  relevant  in  parallel 
computing,  which  will  be  discussed  in  detail  in  the  next  section.  First,  let’s  download 
one  of  the  large  datasets  in  our  datasets  archive,  UQ_VitalSignsData_Case04.csv. 
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#  install. packages ("ff") 

Library  (ff) 

#  vitalsigns<-read .  csv.ffdf  (file="UQ_VitalSignsData_Case04.  csv" header=T) 
vitalsigns<- 

read . csv .  ffdf  (file=" https :// umich . instructure. com/fiLes/366335/downLoad? 
downLoad_frd=l" j  header=T) 

As  mentioned  earlier,  we  cannot  apply  functions  directly  on  this  object. 


mean( vitalsigns$Pulse ) 

##  Warning  in  mean .default (vitalsigns$Pulse) :  argument  is  not  numeric  or 
##  Logical:  returning  NA 
##  [1]  NA 

For  basic  calculations  on  such  large  datasets,  we  can  use  another  package, 
f  fbase.  It  allows  operations  on  f  fdf  objects  using  simple  tasks  like:  mathemat¬ 
ical  operations,  query  functions,  summary  statistics  and  bigger  regression  models 
using  packages  like  biglm,  which  will  be  mentioned  later  in  this  chapter. 

#  install. packages("ffbase") 

Library (f fbase) 
mean( vitalsigns$Pulse ) 

##  [1]  108.7185 


16.4.4  Using  Massive  Matrices  with  bigmemory 

The  previously  introduced  packages  include  alternatives  to  data,  frames.  For 
instance,  the  bigmemory  package  creates  alternative  objects  to  2D  matrices 
(second-order  tensors).  It  can  store  huge  datasets  and  can  be  divided  into  small 
chunks  that  can  be  converted  to  data  frames.  However,  we  cannot  directly  apply 
machine-learning  methods  on  this  type  of  objects.  More  detailed  information  about 
the  bigmemory  package  is  available  online. 


16.5  Parallel  Computing 

In  previous  chapters,  we  saw  various  machine-learning  techniques  applied  as  serial 
computing  tasks.  The  traditional  protocol  involves:  First,  applying  function  1  to  our 
raw  data.  Then,  using  the  output  from  function  1  as  an  input  to  function  2.  This 
process  may  be  iterated  over  a  series  of  functions.  Finally,  we  have  the  terminal 
output  generated  by  the  last  function.  This  serial  or  linear  computing  method  is 
straightforward  but  time  consuming  and  perhaps  sub-optimal. 

Now  we  introduce  a  more  efficient  way  of  computing  -  parallel  computing ,  which 
provides  a  mechanism  to  deal  with  different  tasks  at  the  same  time  and  combine  the 
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outputs  for  all  of  processes  to  get  the  final  answer  faster.  However,  parallel  algo¬ 
rithms  may  require  special  conditions  and  cannot  be  applied  to  all  problems.  If  two 
tasks  have  to  be  run  in  a  specific  order,  this  problem  cannot  be  parallelized. 


16.5.1  Measuring  Execution  Time 

To  measure  how  much  time  can  be  saved  for  different  methods,  we  can  use  function 
system . time ( )  . 

system. time(mean( vitaLsigns$PuLse) ) 

##  user  system  e Lapsed 
##000 

This  means  calculating  the  mean  of  Pulse  column  in  the  vital  signs  dataset 
takes  less  than  0.001  seconds.  These  values  will  vary  between  computers,  operating 
systems,  and  states  of  operations. 


16.5.2  Parallel  Processing  with  Multiple  Cores 

We  will  introduce  two  packages  for  parallel  computing  multi  core  and  snow 
(their  core  components  are  included  in  the  package  parallel).  They  both  have  a 
different  way  of  multitasking.  However,  to  run  these  packages,  you  need  to  have  a 
relatively  modern  multicore  computer.  Let’s  check  how  many  cores  your  computer 
has.  This  function  parallel:  :  detectCores  ( )  provides  this  functionality, 
parallel  is  a  base  package,  so  there  is  no  need  to  install  it  prior  to  using  it. 

L ibrary ( para iiei );  detectCores ( ) 

##  [1]  8 

So,  there  are  eight  (8)  cores  in  my  computer.  I  will  be  able  to  run  up  to  6-8  parallel 
jobs  on  this  computer. 

The  mult  i  core  package  simply  uses  the  multitasking  capabilities  of  the  kernel , 
the  computer’s  operating  system,  to  “fork”  additional  R  sessions  that  share  the  same 
memory.  Imagine  that  we  open  several  R  sessions  in  parallel  and  let  each  of  them  do 
part  of  the  work.  Now,  let’s  examine  how  this  can  save  time  when  running  complex 
protocols  or  dealing  with  large  datasets.  To  start  with,  we  can  use  the  me  1  apply  ( ) 
function,  which  is  similar  to  1  apply  ( ) ,  which  applies  functions  to  a  vector  and 
returns  a  vector  of  lists.  Instead  of  applying  functions  to  vectors  mcapplyO 
divides  the  complete  computational  task  and  delegates  portions  of  it  to  each  avail¬ 
able  core.  To  demonstrate  this  procedure,  we  will  construct  a  simple,  yet  time 
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consuming,  task  of  generating  random  numbers.  Also,  we  can  use  the  system, 
t  ime  ( )  function  to  track  execution  time. 


set.seed(123) 

system . time( cl<-rnorm( 10000000) ) 

##  user  system  e Lapsed 
##  0.64  0.00  0.64 

#  Note  the  multi  cone  calls  may  not  work  on  Windows,  but  will  work  on 
Linux/Mac . 

#This  shows  a  2-core  and  4-vore  invocations 

#  system. time(c2<-unlist (mclapply(l: 2,  f unction (x){rnorm( 5000000) }, 
me. cores  =  2))) 

#  system. time(c4<-unlist (me lapply (1:4,  f unction (x){rnorm( 2500000) }, 
me. cores  =4))) 

#  And  here  is  a  Windows  (single  core  invocation) 

sys tem.t ime (c2<-uniist(mcL app Ly (1: 2 ,  fun ction(x){rnorm(5 000000 ) } , 
me. cores  =  1))) 

##  user  system  e Lapsed 
##  0.65  0.00  0.65 

The  unlist  ( )  is  used  at  the  end  to  combine  results  from  different  cores  into  a 
single  vector.  Each  line  of  code  creates  10,000,000  random  numbers.  The  cl  call 
took  the  longest  time  to  complete.  The  c2  call  used  two  cores  to  finish  the  task  (each 
core  handled  5,000,000  numbers)  and  used  less  time  than  cl.  Finally,  c4  used  all 
four  cores  to  finish  the  task  and  successfully  reduced  the  overall  time.  We  can  see 
that  when  we  use  more  cores  the  overall  time  is  significantly  reduced. 

The  snow  package  allows  parallel  computing  on  multicore  multiprocessor 
machines  or  a  network  of  multiple  machines.  It  might  be  more  difficult  to  use  but 
it’s  also  certainly  more  flexible.  First  we  can  set  how  many  cores  we  want  to  use  via 
makeClusterO  function. 


#  install. packages("snow") 

Library (snow) 

cL<-makeC Luster (2) 

This  call  might  cause  your  computer  to  pop  up  a  message  warning  about  access 
though  the  firewall.  To  do  the  same  task  we  can  use  parLapply  ( )  function  in  the 
snow  package.  Note  that  we  have  to  call  the  object  we  created  with  the  previous 
makeClusterO  function. 


system. time (c2<-unList(parLappLy(cLj  c(5000000}  5000000) ,  function(x)  { 
rnorm(x) }))) 

##  user  system  e Lapsed 
##  0.11  0.11  0.64 
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While  using  parLapply  ( ) ,  we  have  to  specify  the  matrix  and  the  function  that 
will  be  applied  to  this  matrix.  Remember  to  stop  the  cluster  we  made  after  complet¬ 
ing  the  task,  to  release  back  the  system  resources. 

stopCLuster( ci ) 


16.5.3  Parallelization  Using  for  each  and  doParallel 

The  for  each  package  provides  another  option  of  parallel  computing.  It  relies  on  a 
loop-like  process  basically  applying  a  specified  function  for  each  item  in  the  set, 
which  again  is  somewhat  similar  to  apply  (),  lapplyO  and  other  regular 
functions.  The  interesting  thing  is  that  these  loops  can  be  computed  in  parallel 
saving  substantial  amounts  of  time.  The  f  oreach  package  alone  cannot  provide 
parallel  computing.  We  have  to  combine  it  with  other  packages  like  doParallel. 
Let’s  reexamine  the  task  of  creating  a  vector  of  10,000,000  random  numbers.  First, 
register  the  4  compute  cores  using  registerDoParallel  ( )  . 

#  install . packages ("doParallel") 

L ibrary ( do Par a LLeL) 

cL<-maheC Luster (4) 
registerDoParaLLeL(cL  ) 

Then  we  can  examine  the  time  saving  f  oreach  command. 

#install . packages ( "foreach" ) 

L ibrary ( foreach ) 

system. time (c4<-f oreach (i=l :4j  .combine  =  ' c  ' ) 

%dopar%  rnorm(2500000) ) 

##  user  system  e Lapsed 
##  0.11  0.18  0.54 

Here  we  used  four  items  (each  item  runs  on  a  separate  core),  .  combine=c 
allows  foreach  to  combine  the  results  with  the  parameter  c  () ,  generating  the 
aggregate  result  vector. 

Also,  don’t  forget  to  close  the  doParallel  by  registering  the  sequential 
backend. 


unregister< -registerDoSEQ( ) 
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16.5.4  GPU  Computing 

Modern  computers  have  graphics  cards,  GPUs  (Graphical  Processing  Units),  that 
consists  of  thousands  of  cores,  however  they  are  very  specialized,  unlike  the 
standard  CPU  chip.  If  we  can  use  this  feature  for  parallel  computing,  we  may 
reach  amazing  performance  improvements,  at  the  cost  of  complicating  the 
processing  algorithms  and  increasing  the  constraints  on  the  data  format.  Specific 
disadvantages  of  GPU  computing  include  reliance  on  proprietary  manufacturer  (e.g., 
NVidia)  frameworks  and  Complete  Unified  Device  Architecture  (CUD A)  program¬ 
ming  language.  CUDA  allows  programming  of  GPU  instructions  into  a  common 
computing  language.  This  paper  provides  one  example  of  using  GPU  computation  to 
significantly  improve  the  performance  of  advanced  neuroimaging  and  brain  mapping 
processing  of  multidimensional  data. 

The  R  package  gputools  is  created  for  parallel  computing  using  NVidia 
CUDA.  Detailed  GPU  computing  in  R  information  is  available  online. 


16.6  Deploying  Optimized  Learning  Algorithms 

As  we  mentioned  earlier,  some  tasks  can  be  parallelized  easier  than  others.  In  real 
world  situations,  we  can  pick  the  algorithms  that  lend  themselves  well  to 
parallelization.  Some  of  the  R  packages  that  allow  parallel  computing  using  ML 
algorithms  are  listed  below. 


16.6.1  Building  Bigger  Regression  Models  with  biglm 

biglm  allows  training  regression  models  with  data  from  SQL  databases  or  large 
data  chunks  obtained  from  the  f  f  package.  The  output  is  similar  to  the  standard 
lm  ( )  function  that  builds  linear  models.  However,  biglm  operates  efficiently  on 
massive  datasets. 


16.6.2  Growing  Bigger  and  Faster  Random  Forests 
with  bigrf 

The  bigrf  package  can  be  used  to  train  random  forests  combining  the  for  each 
and  do  Parallel  packages.  In  Chap.  15,  we  presented  random  forests  as  machine 
learners  ensembling  multiple  tree  learners.  With  parallel  computing,  we  can  split  the 
task  of  creating  thousands  of  trees  into  smaller  tasks  that  can  be  outsourced  to  each 
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available  compute  core.  We  only  need  to  combine  the  results  at  the  end.  Then,  we 
will  obtain  the  exact  same  output  in  a  relatively  shorter  amount  of  time. 


16.6.3  Training  and  Evaluation  Models  in  Parallel 
with  caret 

Combining  the  caret  package  with  f  oreach,  we  can  obtain  a  powerful  method 
to  deal  with  time-consuming  tasks  like  building  a  random  forest  learner.  Utilizing  the 
same  example  we  presented  in  Chap.  15,  we  can  see  the  time  difference  of  utilizing 
the  f  oreach  package. 


#library(caret) 

system. time (m_rf  <-  train(CHARLSONSCORE  ~  . ,  doto  =  qoLj  method  =  "rf"j 
metric  =  "Kappa" ,  trControL  =  ctrL}  tuneGrid  =  grid_rf)) 

##  user  system  e Lapsed 
##  130.05  0.40  130.49 

It  took  more  than  a  minute  to  finish  this  task  in  standard  execution  model  purely 
relying  on  the  regular  caret  function.  Below,  this  same  model  training  completes 
much  faster  using  parallelization  (less  than  half  the  time)  compared  to  the  standard 
call  above. 


set.seed(123) 
ci<-makeC Luster (4) 
registerDoParaLLeL(cL ) 
getDoParWorkers  (  ) 

##  [1]  4 

system. time (m_rf  <-  train (CHARLSONSCORE  ~  .,  data  =  qoLj  method  =  "rf"j 
metric  =  "Kappa" j  trControL  =  ctrL}  tuneGrid  =  grid_rf)) 

##  user  system  e Lapsed 
##  4.61  0.02  47.70 

unregister< -registerDoSEQ( ) 


16.7  Practice  Problem 

Try  to  analyze  the  co-appearance  network  in  the  novel  “Les  Miserables”.  The  data 
contains  the  weighted  network  of  co-appearances  of  characters  in  Victor  Hugo’s 
novel  “Les  Miserables”.  Nodes  represent  characters  as  indicated  by  the  labels  and 
edges  connect  any  pair  of  characters  that  appear  in  the  same  chapter  of  the  book.  The 
values  on  the  edges  are  the  number  of  such  co-appearances. 
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miserables<-read.table("https://umich.instructure.com/files/330389/download? 
download_frd=l",  sep="",  header=F)  head(miserables) 

Also,  try  to  interrogate  some  of  the  larger  datasets  we  have  by  using  alternative 
parallel  computing  and  big  data  analytics. 


16.8  Assignment:  16.  Specialized  Machine  Learning  Topics 

16.8.1  Working  with  Website  Data 

•  Download  the  Main  SOCR  Wiki  Page  and  compare  RCurl  and  httr. 

•  Read  and  write  XML  code  for  the  SOCR  Main  Page. 

•  Scrape  the  data  from  the  SOCR  Main  Page. 


16.8.2  Network  Data  and  Visualization 

•  Download  03_les_miserablese_GraphData.txt 

•  Visualize  this  undirected  network. 

•  Summary  the  graph  and  explain  the  output. 

•  Calculate  degree  and  the  centrality  of  this  graph. 

•  Find  out  some  important  characters. 

•  Will  the  result  change  or  not  if  we  assume  the  graph  is  directed? 


16.8.3  Data  Conversion  and  Parallel  Computing 

•  Download  CaseStudyl2_  AdultsHeartAttack_Data.xlsx  or  require  online. 

•  load  this  data  as  data  frame. 

•  Use  Export  ( )  or  write  .  xlsx  ( )  to  renew  the  xlsx  file. 

•  Use  rio  package  to  convert  this  ".xlsx"  "file  to"  ".csv". 

•  Generate  generalizing  tabular  data  structures. 

•  Generate  data.table. 

•  Create  disk-based  data  frames  and  perform  basic  calculation. 

•  Perform  basic  calculation  on  the  last  5  columns  as  a  big  matrix. 

•  Use  DIAGNOSIS,  SEX,  DRG,  CHARGES,  LOS  and  AGE  to  predict  DIED  with 
randomForest  setting  ntree=2  0  000.  Notice:  sample  without  replacement  to 
get  an  as  large  as  possible  balanced  dataset. 

•  Run  train  ( )  in  caret  and  detect  the  execute  time. 

•  Detect  cores  and  make  proper  number  of  clusters. 
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•  Rerun  train  ( )  parallelized  and  compare  the  execute  time. 

•  Use  foreach  and  doMC  to  design  a  parallelized  random  forest  with 
nt  ree=2  0  0  0  0  totally  and  compare  the  execute  time  with  sequential  execution. 
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Chapter  17 

Variable/Feature  Selection 


® 
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updates 


As  we  mentioned  in  Chap.  16,  variable  selection  is  very  important  when  dealing  with 
bioinformatics,  healthcare,  and  biomedical  data,  where  we  may  have  more  features 
than  observations.  Variable  selection,  or  feature  selection,  can  help  us  focus  only  on 
the  core  important  information  contained  in  the  observations,  instead  of  every  piece 
of  information.  Due  to  presence  of  intrinsic  and  extrinsic  noise,  the  volume  and 
complexity  of  big  health  data,  and  different  methodological  and  technological 
challenges,  this  process  of  identifying  the  salient  features  may  resemble  finding  a 
needle  in  a  haystack.  Here,  we  will  illustrate  alternative  strategies  for  feature 
selection  using  filtering  (e.g.,  correlation-based  feature  selection),  wrapping  (e.g., 
recursive  feature  elimination),  and  embedding  (e.g.,  variable  importance  via  random 
forest  classification)  techniques. 

The  next  Chap.  18,  provides  the  details  about  another  powerful  technique  for 
variable-selection  using  decoy  features  to  control  the  false  discovery  rate  of 
choosing  inconsequential  features. 


17.1  Feature  Selection  Methods 

There  are  three  major  classes  of  variable  or  feature  selection  techniques — filtering- 
based,  wrapper-based,  and  embedded  methods. 


17.1.1  Filtering  Techniques 

•  Univariate :  Univariate  filtering  methods  focus  on  selecting  single  features  with 
high  scores  based  on  some  statistics  like  / 2  or  Information  Gain  Ratio.  Each 
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feature  is  viewed  as  independent  of  the  others,  effectively  ignoring  interactions 
between  features. 

•  Examples:  x  »  Euclidean  distance,  i- test,  and  Information  gain. 

•  Multivariate :  Multivariate  filtering  methods  rely  on  various  (multivariate)  statis¬ 
tics  to  select  the  principal  features.  They  typically  account  for  between-feature 
interactions  by  using  higher-order  statistics  like  correlation.  The  basic  idea  is  that 
we  iteratively  triage  variables  that  have  high  correlations  with  other  features. 

•  Examples:  Correlation-based  feature  selection,  Markov  blanket  filter,  and  fast 
correlation-based  feature  selection. 


17.1.2  Wrapper  Methods 

•  Deterministic :  Deterministic  wrapper  feature  selection  methods  either  start  with 
no  features  (forward-selection)  or  with  all  features  included  in  the  model 
(backward- selection)  and  iteratively  refine  the  set  of  chosen  features  according 
to  some  model  quality  measures.  The  iterative  process  of  adding  or  removing 
features  may  rely  on  statistics  like  the  Jaccard  similarity  coefficient. 

•  Examples:  Sequential  forward  selection,  Recursive  Feature  Elimination,  Plus 
q  take-away  r,  and  Beam  search. 

•  Randomized :  Stochastic  wrapper  feature  selection  procedures  utilize  a  binary 
feature-indexing  vector  indicating  whether  or  not  each  variable  should  be 
included  in  the  list  of  salient  features.  At  each  iteration,  we  randomly  perturb 
the  binary  indicators  vector  and  compare  the  combinations  of  features  before  and 
after  the  random  inclusion-exclusion  indexing  change.  Finally,  we  pick  the 
indexing  vector  corresponding  with  the  optimal  performance  based  on  some 
metric,  like  acceptance  probability  measures.  The  iterative  process  continues 
until  no  improvement  of  the  objective  function  is  observed. 

•  Examples:  Simulated  annealing,  genetic  algorithms,  distribution-  and  kernel- 
estimation  algorithms. 


17.1.3  Embedded  Techniques 

•  Embedded-feature  selection  techniques  are  based  on  various  classifiers,  predic¬ 
tors,  or  clustering  procedures.  For  instance,  we  can  accomplish  feature  selection 
by  using  decision  trees  where  the  separation  of  the  training  data  relies  on  features 
associated  with  the  highest  information  gain.  Further  tree  branching,  separating 
the  data  deeper,  may  utilize  weaker  features.  This  process  of  choosing  the  vital 
features  based  on  their  separability  characteristics  continues  until  the  classifier 
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generates  group  labels  that  are  mostly  homogeneous  within  clusters/classes  and 
largely  heterogeneous  across  groups,  and  when  the  information  gain  of  further 
tree  branching  is  marginal.  The  entire  process  may  be  iterated  multiple  times  to 
select  the  features  that  appear  most  frequently. 

•  Examples:  Decision  trees,  random  forests,  weighted  naive  Bayes,  and  feature 
selection  using  weighted- SVM. 

The  different  types  of  feature  selection  methods  have  their  own  pros  and  cons.  In  this 
chapter,  we  are  going  to  introduce  the  randomized  wrapper  method  using  the 
Boruta  package,  which  utilizes  the  random  forest  classification  method  to  output 
variable  importance  measures  (VIMs).  Then,  we  will  compare  its  results  with 
Recursive  Feature  Elimination,  a  classical  deterministic  wrapper  method. 


17.2  Case  Study:  ALS 
17.2.1  Step  1:  Collecting  Data 

First  things  first,  let’s  explore  the  dataset  we  will  be  using.  Case  Study 
15,  Amyotrophic  Lateral  Sclerosis  (ALS),  examines  the  patterns,  symmetries,  asso¬ 
ciations  and  causality  in  a  rare  but  devastating  disease,  amyotrophic  lateral  sclerosis 
(ALS),  also  known  as  Lou  Gehrig  disease.  This  ALS  case-study  reflects  a  large 
clinical  trial  including  big,  multi- source  and  heterogeneous  datasets.  It  would  be 
interesting  to  interrogate  the  data  and  attempt  to  derive  potential  biomarkers  that  can 
be  used  for  detecting,  prognosticating,  and  forecasting  the  progression  of  this 
neurodegenerative  disorder.  Overcoming  many  scientific,  technical  and  infrastruc¬ 
ture  barriers  is  required  to  establish  complete,  efficient,  and  reproducible  protocols 
for  such  complex  data.  These  pipeline  workflows  start  with  ingesting  the  raw  data, 
preprocessing,  aggregating,  harmonizing,  analyzing,  visualizing  and  interpreting  the 
findings. 

In  this  case-study,  we  use  the  training  dataset  that  contains  2223  observations  and 
131  numeric  variables.  We  select  ALSFRS  slope  as  our  outcome  variable,  as  it 
captures  the  patients’  clinical  decline  over  a  year.  Although  we  have  more  observa¬ 
tions  than  features,  this  is  one  of  the  examples  where  multiple  features  are  highly 
correlated.  Therefore,  we  need  to  preprocess  the  variables,  e.g.,  apply  feature 
selection,  before  commencing  with  predictive  analytics. 


1 7.2.2  Step  2:  Exploring  and  Preparing  the  Data 

The  dataset  is  located  in  our  case-studies  archive.  We  can  use  read.csvO  to 
directly  import  the  CSV  dataset  into  R  using  the  URL  reference. 
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ALS .  train<-read.  csv( "https ://umich .instructure .  com/fiLes/1789624/downLoacl?clo 
wnLoad_frd=l ") 
summary (ALS . train ) 


## 

ID 

Age_mean 

ALbumin_max 

ALbumin_median 

## 

Min. 

1.0 

Min.  : 18.00 

Min.  : 37.00 

Min.  : 34 . 50 

## 

1st  Qu. 

:  614.5 

1st  Qu. :47. 00 

1st  Qu. :45.00 

1st  Qu. :42.00 

## 

Median 

: 1213.0 

Median  : 55.00 

Median  :47.00 

Median  :44.00 

## 

Mean 

: 1214. 9 

Mean  : 54 . 55 

Mean  :47.01 

Mean  :43.95 

## 

3rd  Qu. 

: 1815. 5 

3rd  Qu.:63.00 

3rd  Qu.:49.00 

3rd  Qu.:46.00 

## 

Max. 

: 2424.0 

Max.  : 8 1.00 

Max.  : 7 0.30 

Max.  : 51.10 

## 

Urine . Ph_median 

Urine . Ph_min 

## 

Min. 

: 5 . 000 

Min.  : 5.000 

## 

1st  Qu. 

: 5.000 

1st  Qu. :5.000 

## 

Median 

: 6 . 000 

Median  : 5.000 

## 

Mean 

:5.  711 

Mean  :5.183 

## 

3rd  Qu. 

: 6 . 000 

3rd  Qu.:5.000 

## 

Max. 

: 9 . 000 

Max.  :8.000 

There  are  13 1  features  and  some  of  variables  represent  statistics  like  max ,  min  and 
median  values  of  the  same  clinical  measurements. 


17.2.3  Step  3:  Training  a  Model  on  the  Data 

Now  let’s  explore  the  Boruta  ()  function  in  the  Boruta  package  to  perform 
variables  selection,  based  on  random  forest  classification.  Boruta  ( )  includes  the 
following  components: 

vs< -Boruta (class-features ,  data=Mydata,  pValue  =  0.01,  mcAdj  = 
TRUE,  maxRuns  =  100,  doTrace=0,  getlmp  =  getlmpRfZ,  ...) 


•  class:  variable  for  class  labels. 

•  features:  potential  features  to  select  from. 

•  data:  dataset  containing  classes  and  features. 

•  pValue:  confidence  level.  Default  value  is  0.01  (Notice  we  are  applying  mul¬ 
tiple  variable  selection. 

•  mcAdj :  Default  TRUE  to  apply  a  multiple  comparisons  adjustment  using  the 
Bonferroni  method. 

•  maxRuns:  maximal  number  of  importance  source  runs.  You  may  increase  it  to 
resolve  attributes  left  Tentative. 

•  doTrace:  verbosity  level.  Default  0  means  no  tracing,  1  means  reporting 
decision  about  each  attribute  as  soon  as  it  is  justified,  2  means  same  as  1 ,  plus 
at  each  importance  source  run  reporting  the  number  of  attributes.  The  default  is 
0  where  we  don’t  do  the  reporting. 
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•  get  Imp:  function  used  to  obtain  attribute  importance.  The  default  is  getlmpRfZ , 
which  runs  random  forest,  from  the  ranger  package,  and  gathers  Z-scores  of  the 
mean  decreased  accuracy  measure. 

The  resulting  vs  object  is  of  class  Boruta  and  contains  two  important 
components: 

•  finalDecision:  a  factor  of  three  values:  Confirmed,  Re  j  ected  or  Tenta¬ 
tive,  containing  the  final  results  of  the  feature  selection  process. 

•  ImpHi  story:  a  data  frame  of  importance  of  attributes  gathered  in  each  impor¬ 
tance  source  run.  Besides  the  predictors’  importance,  it  contains  maximal,  mean 
and  minimal  importance  of  shadow  attributes  for  each  run.  Rejected  attributes 
get  -Inf  importance.  This  output  is  set  to  NULL  if  we  specify 
holdHistory=FALSE  in  the  Boruta  call. 


Note:  Running  the  code  below  will  take  several  minutes. 


#  install . packages("Boruta" ) 

Library (Boruta) 
set. seed (123) 

aLs<-Boruta(ALSFRS_sLope~. -IDj  data=ALS . train j  doTrace=0) 
print(ais) 

##  Boruta  performed  99  iterations  in  4.683657  mins. 

##  28  attributes  confirmed  important :  ALSFRS_TotaL_maXj 

##  ALSFRS_TotaL_medianj  ALSFRS_TotaL_minj  ALSFRS_TotaL_rangej 
##  Creatinine_median  and  23  more; 

##  59  attributes  confirmed  unimportant :  ALbumin_maXj  ALbumin_medianj 

##  ALbumin_minj  ALT .SGPT ._maXj  ALT . SGPT . _median  and  54  more; 

##  12  tentative  attributes  Left:  Age_meanj  ALbumin_rangej 


##  Creatinine_maXj  Hematocrit_median ,  Hematocrit_range  and  7  more; 

aLs$ImpHistory[l:6j  1:10] 

## 

Age_mean  ALbumin_max 

ALbumin_median 

ALbumin_min  ALbumin_range 

##  [  i 

1.2031427  1.4969268 

0.6976378 

0.9385041 

1.979510 

##  [ 2 ,] 

-0.1998469  0.7204092 

-1.5626360 

0.5777092 

2.573882 

##  [ 3 ,] 

1.9272058  -1.0274668 

0.2216170 

-1.2234402 

1 . 843967 

##  [ 4 ,] 

0.5763244  0.9097371 

0.2960979 

0.6137624 

2.184383 

##  [5,] 

3.3655147  1.9412326 

0.3849548 

1 . 7309793 

1.134676 

##  [ 6 ,] 

0.2603118  -0.0287943 

1.4164860 

2.3251879 

2.259974 

## 

ALSFRS_TotaL_max  ALSFRS_TotaL_median  ALSF RS_Tota L_min 

##  [1,] 

6.925233 

9.551064 

15.92924 

##  [ 2 ,] 

8.124101 

7.867399 

14.94650 

##  [3,] 

7.443326 

8.735702 

17.26469 

##  [4j] 

7. 578267 

7.868885 

16.95563 

##  [  5 

7.554582 

7.248834 

15.42697 

##  [6,] 

7.516362 

7.145460 

14.94824 

## 

ALSFRS_TotaL_range  ALT .  SGPT . _max 

##  [lj] 

25.78135 

4.1516252 

##  [  2 ,7 

26.11722 

1.2187027 

##  [3,] 

25.61523 

2.1618804 

##  [ 4 ,] 

28.19229 

0.4305607 

##  [ 5 ,] 

24.90620 

1 . 2043325 

##  [6,] 

26.57093 

0.8463782 
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This  is  a  fairly  time-consuming  computation.  Boruta  determines  the  important 
attributes  from  unimportant  and  tentative  features.  Here  the  importance  is  measured 
by  the  out-of-bag  (OOB)  error.  The  OOB  estimates  the  prediction  error  of  machine¬ 
learning  methods  (e.g.,  random  forests  and  boosted  decision  trees)  that  utilize 
bootstrap  aggregation  to  sub-sample  training  data.  OOB  represents  the  mean  pre¬ 
diction  error  on  each  training  sample  xh  using  only  the  trees  that  did  not  include  xt  in 
their  bootstrap  samples.  Out-of-bag  estimates  provide  internal  assessment  of  the 
learning  accuracy  and  avoid  the  need  for  an  independent  external  validation  dataset. 

The  importance  scores  for  all  features  at  every  iteration  are  stored  in  the  data 
frame  als$ImpHi story.  Let’s  plot  a  graph  depicting  the  essential  features. 

Note :  Again,  running  this  code  will  take  several  minutes  to  complete  (Fig.  17.1). 

pLot(aLSj  xLab=""j  xaxt="n") 
iz< - LappLy (1 : ncoL (ais$ImpHistory ) ,  function(i ) 
ais$ImpHistory [is. finite (ais$ImpHistory[J  i])}  i]) 
names (Lz)<-coL names ( a Ls$ImpHi story) 

Lb<-sort(sappLy( LZj  median)) 

axis(side=lj  Las=2j  LabeLs=names( Lb) ,  at=l:ncoL(aLs$ImpHistory)j 
cex.axis=0. 5j  font  =4) 


I 


Fig.  17.1  Ranked  variables  importance  using  box  and  whisker  plots  for  each  feature 
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We  can  see  that  plotting  the  graph  is  easy  but  extracting  matched  feature  names 
may  require  more  work.  The  basic  plot  is  done  by  this  call  plot (als, 
xlab=" ",  xaxt="n"),  where  xaxt  =  "n"  means  we  suppress  plotting  of 
v-axis.  The  following  lines  in  the  script  reconstruct  the  v-axis  plot.  1  z  is  a  list  created 
by  the  1  apply  ( )  function.  Each  element  in  1  z  contains  all  the  important  scores  for 
a  single  feature  in  the  original  dataset.  Also,  we  excluded  all  rejected  features  with 
infinite  importance.  Then,  we  sorted  these  non-rejected  features  according  to  their 
median  importance  and  printed  them  on  the  v-axis  by  using  axis  ( ) . 

We  have  already  seen  similar  groups  of  boxplots  back  in  Chaps.  3  and  4.  In  this 
graph,  variables  with  green  boxes  are  more  important  than  the  ones  represented  with 
red  boxes,  and  we  can  see  the  range  of  importance  scores  within  a  single  variable  in 
the  graph. 

It  may  be  desirable  to  get  rid  of  tentative  features.  Notice  that  this  function  should 
be  used  only  when  strict  decision  is  highly  desired,  because  this  test  is  much  weaker 
than  Boruta  and  can  lower  the  confidence  of  the  final  result. 

final . als<-TentativeRoughFix( als ) 
print (final. als) 

##  Boruta  performed  99  iterations  in  4.683657  mins. 

##  Tentatives  roughfixed  over  the  Last  99  iterations . 

##  32  attributes  confirmed  important :  ALSFRS_Total_maXj 

##  ALSFRS_Total_medianj  ALSFRS_Total_minj  ALSFRS_Total_rangej 
##  Creatinine_median  and  27  more ; 

##  67  attributes  confirmed  unimportant :  Age_mean}  Albumin_maXj 

##  Albumin_median j  Albumin_minj  Albumin_range  and  62  more ; 

final . als$finalDecision 


## 

Age_mean 

Albumin_max 

## 

Rejected 

Rejected 

## 

Albumin_median 

Albumin_min 

## 

Rejected 

Rejected 

## 

Albumin_range 

ALSFRS_Total_max 

## 

Rejected 

Confirmed 

## 

ALSFRS_Total_median 

ALSFRS_Total_min 

## 

Confirmed 

Confirmed 

## 

Urine . Ph_max 

Urine . Ph_median 

## 

Rejected 

Rejected 

## 

Urine . Ph_min 

## 

Rejected 

##  Levels: 

Tentative  Confirmed  Rejected 

getConfirmedFormu  La ( final . als ) 


##  ALSFRS_slope  ~  ALSFRS_Total_max+ALSFRS_Total_median  +  ALSFRS_Total_min  + 
##  ALSFRS_Total_range  +  Creatinine_median  +  Creatinine_min  + 

##  hands_max  +  hands_median  +  hands_min  +  hands_range+Hematocrit_max+ 

##  Hematocrit_min+Hematocrit_range+Hemoglobin_median+Hemoglobin_range  + 

##  leg_max  +  Leg_median  +  Leg_min  +  leg_range  +  mouth_max  + 

##  mouth_median  +  mouth_min  +  mouth_range  +  onset_delta_mean  + 

##  pulse_max+respiratory_median  +  respiratory_min  +  respiratory_range+ 
##  trunk_max  +  trunk_median  +  trunk_min  +  trunk_range 

##  <environment:  0x000000000989d6f8> 
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#  report  the  Boruta  "Confirmed"  &  "Tentative"  features,  removing  the 
"Rejected"  ones 

print(finai. aLs$finaLDecision [final . aLs$finaLDecision  %in%  c(  'Confirmed" , 
"Tentative ")]) 


## 

ALSFRS_Total_max  ALSFRS_Total_median 

ALSFRS_Total_min 

## 

Confirmed 

Confirmed 

Confirmed 

## 

ALSFRS_TotaL_range 

Creatinine_median 

Creatinine_min 

## 

Confirmed 

Confirmed 

Confirmed 

## 

hands_max 

hands_median 

hands_min 

## 

Confirmed 

Confirmed 

Confirmed 

## 

hands_range 

Hematocrit_max 

Hematocrit_min 

## 

Confirmed 

Confirmed 

Confirmed 

## 

Hematocrit_range 

Hemog  Lobin_median 

Hemog Lobin_range 

## 

Confirmed 

Confirmed 

Confirmed 

## 

Leg_max 

Leg_median 

Leg_min 

## 

Confirmed 

Confirmed 

Confirmed 

## 

Leg_range 

mouth_max 

mouth_median 

## 

Confirmed 

Confirmed 

Confirmed 

## 

mouth_min 

mouth_range 

onset_deLta_mean 

## 

Confirmed 

Confirmed 

Confirmed 

## 

pulsejmax 

respiratory_median 

respiratory_min 

## 

Confirmed 

Confirmed 

Confirmed 

## 

respiratory_range 

trunk_max 

trunk_median 

## 

Confirmed 

Confirmed 

Confirmed 

## 

trunk_min 

trunk_range 

## 

Confirmed 

Confirmed 

##  Levels:  Tentative  Confirmed  Rejected 

#  how  many  are  actually  "confirmed"  as  important/salient? 
impBoruta  <-  final . ais$finaiDecision [final . ais$finaLDecision  %in% 
c( "Confirmed ")];  Length (impBoruta) 

##  [1]  32 

The  report  above  shows  the  final  features  selection  including  only  the  “con¬ 
firmed”  and  “Tentative”  features. 


17.2.4  Step  4:  Evaluating  Model  Performance 

Comparing  with  RFE 

Let’s  compare  the  Boruta  results  against  a  classical  variable  selection  method — 
recursive  feature  elimination  (RFE).  First,  we  need  to  load  two  packages:  caret 
and  randomForest.  Then,  as  we  did  in  Chap.  15,  we  must  specify  a  resampling 
method.  Here  we  use  10-fold  CV  to  do  the  resampling. 

Library ( caret) 

Library (randomForest ) 
set,seed(123) 

control<-rfeControL( functions  =  rfFuncSj  method  =  "cv'j  number=10) 

Now,  all  preparations  are  complete  and  we  are  ready  to  do  the  RFE  variable 
selection. 
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rf.train<-rfe(ALS.train[j  -c(lj  7)],  ALS.train[j  7 J,  sizes=c(10J20J30J40)J 
rfeControL=controL ) 
rf.  train 


##  Recursive  feature  seLection 

##  Outer  resampling  method:  Cross-Validated  (10  fold) 
##  Resampling  performance  over  subset  size: 


## 

Variables 

RMSE  Rsquared  RMSESD  RsquaredSD  Selected 

## 

10 

0.3500 

0.6837  0.03451 

0.03837 

## 

20 

0.3471 

0.6894  0.03230 

0.03374 

## 

30 

0.3468 

0.6900  0.03135 

0.02967 

* 

## 

40 

0.3473 

0.6895  0.03061 

0.02887 

## 

99 

0.3503 

0.6842  0.02995 

0.02868 

## 

The  top  5  \ 

jariables 

(out  of  30) : 

##  ALSFRS_Total_rangej  trunk_rangej  hands_rangejmouth_rangejALSFRS_Total_min 

This  calculation  may  take  a  long  time  to  complete.  The  RFE  invocation  is 
different  from  Boruta.  Here  we  have  to  specify  the  feature  data  frame  and  the 
class  labels  separately.  Also,  the  sizes  =  option  allows  us  to  specify  the  number  of 
features  we  want  to  include  in  the  model.  Let’s  trysizes=c(10/  20,  30,  40) 
to  compare  the  model  performance  for  alternative  numbers  of  features. 

To  visualize  the  results,  we  can  plot  the  RMSE  error  for  the  five  different  feature 
size  combinations  listed  in  the  summary.  The  one  with  30  features  has  the  lowest 
RMSE  value.  This  result  is  similar  to  the  Boruta  output,  which  selected  around 
30  features  (Fig.  17.2). 


piot(rf .train j  type=c("g" j  " o "),  cex=lj  col=l:5) 


Fig.  17.2  Root-mean  square  cross-validation  error  rate  for  random  forest  classification  of  the  ALS 
study  against  the  number  of  features 
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Using  the  functions  predictors  ()  and  getSelectedAttributes  ( ) , 
we  can  compare  the  final  results  of  the  two  alternative  feature  selection  methods. 

predRFE  <-  predictors (rf .train) 

predBoruta  <-  getSelectedAttributes (final. aLSj  withTentative  =  F) 

The  Boruta  and  RFE  feature- selction  results  are  almost  identical. 


intersect (predBorutcij  predRFE) 


##  [1]  " ALSF RS_Total_max " 

##  [4]  "ALSFRS_Total_range 

##  [7]  "hands_median" 

##  [10]  "Hematocrit_max" 

##  [13]  " Leg_median" 

##  [16]  "mouth_median" 

##  [19]  "onset_delta_mean" 
##  [22]  " respiratory _range" 
##  [25]  "trunk_min" 


"ALSFRS_Total_median  " 
"Creatinine_min " 
"hands_min" 
"Hemoglobin_median " 
"Leg_min" 

"mouth_min" 

" respiratory _median " 
"trunk_max " 
"trunk_range " 


"ALSFRS_Total_min 
"hands_max" 
"hands_range " 
"Leg_max" 

"  Leg_range " 
"mouth_range " 

" respiratory_min " 
"trunk  median" 


There  are  26  common  variables  chosen  by  the  two  techniques,  which  suggests 
that  both  the  Boruta  and  RFE  methods  are  robust.  Also,  notice  that  the  Boruta 
method  can  give  similar  results  without  utilizing  the  size  option.  If  we  want  to 
consider  ten  or  more  different  sizes,  the  procedure  will  be  quite  time  consuming.  Thus, 
the  Boruta  method  is  effective  when  dealing  with  complex  real  world  problems. 


Comparing  with  Stepwise  Feature  Selection 

Next,  we  can  contrast  the  Boruta  feature  selection  results  against  another  classical 
variable  selection  method  -  stepwise  model  selection.  Let’s  start  with  fitting  a 
bidirectional  stepwise  linear  model-based  feature  selection. 

data2  <-  ALS.train[j  -1] 

#  Define  a  base  model  -  intercept  only 

base. mod  <-  lm(ALSFRS_slope  ~  1  ,  data=  data2) 

#  Define  the  full  model  -  including  all  predictors 
all. mod  <-  Lm(ALSFRS_slope  ~  .  ,  data=  data2) 

#  ols_step  <-  lm(ALSFRS_slope  ~  data=data2) 

ols_step  <-  step(base.modj  scope  =  List (lower=base. mod j  upper  =  all. mod) } 
direction  =  'both'j  k=2}  trace  =  F) 
summary (ols_step) ;  ols_step 

##  Call: 

##  Lm( formula  =  ALSFRS_slope  ~  ALSFRS_Total_range  +  ALSFRS_Total_median  + 

##  ALSFRS_Total_min  +  Calcium_range  +  Calcium_max  +  bp_diastolic_min  + 

##  onset_delta_mean  +  Calcium_min  +  ALbumin_range  +  GLucose_range  + 

##  ALT .SGPT  ._median  +  AST .SGOT  ._median  +  Glucose_max  +  GLucose_min  + 

##  Creatinine_range  +  Potassium_range  +  Chloride_range  +  Cbloride_min+ 

##  Sodium_median  +  respiratory_min  +respiratory_range+respiratory_max+ 

##  trunk_range  +  pulse_range  +  Bicarbonate_max  +  Bicarbonate_range  + 

##  Chloride_max  +  onset_site_mean  +  trunk_max  +  Gender_mean  + 

##  Creatinine_min}  data  =  data2) 
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## 

##  Residuals : 

##  Min  IQ  Median  3Q  Max 

##  -2.22558  -0.17875  -0.02024  0.17098  1.95100 
## 

##  Coefficients : 


## 

Estimate 

Std.  Error 

t  value 

Pr(> It  1 ) 

##  (Intercept) 

4. 176e-01 

6. 064e-01 

0.689 

0.491091 

##  ALSFRS_Total_range 

-2 . 260e+01 

1 . 359e+00 

-16.631 

<  2e-16 

*** 

##  ALSFRS_Total_median 

-3 . 388e-02 

2. 868e-03 

-11.812 

<  2e-16 

*** 

##  ALSFRS_Total_min 

2. 821e-02 

3 . 310e-03 

8.524 

<  2e-16 

*** 

##  trunk_max 

2 . 288e-02 

8.453e-03 

2.706 

0.006854 

** 

##  Gender_mean 

-3 . 360e-02 

1 . 751e-02 

-1.919 

0.055066 

• 

##  Creatinine_min 

7. 643 e- 04 

4.  977 e- 04 

1.536 

0.124771 

##  --- 

##  Signif.  codes:  0  ' 

***'  0.001 

'**'  0'01  '* 

'  0.05 

0.1  ' 

'  1 

## 

##  Residual  standard  error:  0.3355  on  2191  degrees  of  freedom 
##  Multiple  R-squared:  0.7135j  Adjusted  R-squared:  0.7094 


## 

F-statistic :  176  on  31  and  2191  DF ,  p- 

value:  <  2.2e-16 

## 

## 

Call: 

##  lm( formula  =  ALSFRS_ 

slope  ~  ALSFRS_Total_ 

range  +  ALSFRS_Total_median  + 

## 

ALSFRS_Total_min 

+  Calcium_range  +  Calcium_max  +  bp_diastolic_min 

## 

onset_delta_mean 

+  Calcium_min  +  Albumin_range  +  Glucose_range  + 

## 

AL  T.  SGPT._median 

+  AST.  SGOT. _median  ■+ 

■  GLucose_max  +  GLucose_min  + 

## 

Creatinine_range 

+  Potassium_range  + 

Chloride_range  +Chloride_min+ 

## 

Sodium_median  + 

respiratory _min+respiratory_range+respiratory_max+ 

## 

trunk_range  +  pulse_range  +  Bicarbonate_max  +  Bicarbonate_range  + 

## 

Chloride_max  +  onset_site_mean  +  trunk_max  +  Gender_mean  + 

## 

Creatinine_min  j 

data  =  data2) 

## 

## 

Coefficients : 

## 

(Intercept) 

ALSFRS_Total_range 

ALSFRS_Total_median 

## 

4. 176e-01 

-2.260e+01 

-3 . 388e-02 

## 

ALSFRS_Total_min 

Calcium_range 

Calcium_max 

## 

2. 821e-02 

2.410e+02 

-4. 258e-01 

## 

bp_diastolic_min 

onset_delta_mean 

Calcium_min 

## 

-2 . 249e-03 

-5.461e-05 

3. 579 e- 01 

## 

Albumin_range 

GLucose_range 

AL  T.  SGPT._median 

## 

-2. 305e+00 

-1 . 510e+01 

-2 . 300e-03 

## 

AST. SGOT . _median 

GLucose_max 

GLucose_min 

## 

3 . 369e-03 

3 . 279e-02 

-3. 507e-02 

## 

Creatinine_range 

Potassium_range 

Chloride_range 

## 

5.076e-01 

-4. 535e+00 

5 . 318e+00 

## 

Chloride_min 

Sodium_median 

respiratory_min 

## 

1 . 672e-02 

-9. 830e-03 

-1. 453e-01 

## 

respiratory_range 

respiratory_max 

trunk_range 

## 

-5 . 834e+01 

1 . 712e-01 

-8.  705e+00 

## 

pulse_range 

Bicarbonate_max 

Bicarbonate_range 

## 

-5 . 117e-01 

7. 526e-03 

-2. 204e+00 

## 

Chloride_max 

onset_site_mean 

trunk_max 

## 

-6 . 918e-03 

3 . 359e-02 

2 . 288e-02 

## 

Gender_mean 

Creatinine_min 

## 

-3 . 360e-02 

7. 643e-04 
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We  can  report  the  stepwise  “Confirmed”  (salient)  features: 


#  get  the  shortlisted  variable 

stepwiseConfirmedVars  <-  names (unList(oLs_step[ [1] ] ) ) 

#  remove  the  intercept 

stepwiseConfirmedVars  <-  stepwiseConfirmedVars[ !  stepwiseConfirmedVars  %in%  " 

(Intercept) "] 

print  (  stepiAjiseConfirmedVars  ) 


##  [1]  " ALSF RS_TotaL_range 

##  [4]  "CaLcium_range" 

##  [7]  "onset_deita_mean" 

##  [10]  "GLucose_range" 

##  [13]  "GLucose_max" 

##  [16]  "Potassium_range" 
##  [19]  "Sodium_median" 

##  [22]  " respiratory _max" 
##  [25]  "Bicarbonate_max" 
##  [28]  "onset_site_mean" 
##  [31]  "Creatinine_min" 


'ALSFRS_TotaL_median 
'CaLcium_max  " 
'CaLcium_min" 

ALT.  SGPT . _median  " 

' GLucose_min " 

'  ChLoride_range" 

'  respiratory _min " 
'trunk_range  " 

'  Bicarbonate_range " 
'trunk  max" 


ALSFRS_TotaL_min" 
'  bp_diasto  Lic_min " 
ALbumin_range" 
'AST.  SGOT . _median  " 
'  Creatinine_range " 
'ChLoride_min" 

'  respiratory_range 
’puLse_range" 
'ChLoride_max  " 
'Gender  mean" 


Again,  the  feature  selection  results  of  Boruta  and  step  are  similar. 


Library (mLbench) 

Library ( caret) 

#  estimate  variable  importance 
predStepiA/ise  <-  varImp(oLs_stepj  scaLe=FALSE) 

#  summarize  importance 
print  (predStepiAjise  ) 


## 

##  ALSFRS_TotaL_range 
##  ALSFRS_TotaL_median 
##  ALSFRS_TotaL_min 
##  CaLcium_range 
##  CaLcium_max 
##  bp_diastoLic_min 
##  onset_deLta_mean 
##  CaLciumjwin 
##  ALbumin_range 
##  GLucose_range 
##  ALT. SGPT. _median 
##  AST. SGOT. _median 
##  GLucose_max 
##  GLucose_min 
##  Creatinine_range 
##  Potassium_range 
##  ChLoride_range 
##  ChLoride_min 
##  Sodium_median 
##  respiratory_min 
##  respiratory_range 
##  respiratory_max 
##  trunk_range 
##  puLse_range 


OveraLL 
16.630592 
11.812263 
8.523606 
5. 754045 
4.812942 
2.539766 
2. 758465 
3.767450 
2.812018 
5.156259 
2.876338 
2.641369 
4.629759 
4.022642 
2.293301 
1 . 739268 
4.474709 
4.403551 
2.118710 
5.948488 
5.756735 
5.041816 
2.819029 
1.696811 


17.3  Practice  Problem 


569 


##  Bicar bonate_max 
##  Bicar bonate_range 
##  ChLoride_max 
##  onset_site_mean 
##  trunk_max 
##  Gender_mean 
##  Creatinine  min 


2.568068 
2.303757 
1 . 750666 
1 . 663481 
2.706410 
1.919380 
1 . 535642 


#  plot  predStepwise 

#  plot (predStepwise) 


#  Boruta  vs.  Stepwise  feataure  selection 
intersect (predBorutaj  stepwiseConfirmedVars ) 

##  [1]  "ALSFRS_TotaL_median"  "ALSFRS_TotaL_min 
##  [4]  "Creatinine_min"  "onset_deita_mean 
##  [7]  " respiratory _range"  "trunk_max" 


"ALSFRS_TotaL_range 
"respiratory_min " 
"trunk_range" 


There  are  about  nine  common  variables  chosen  by  the  Boruta  and  Stepwise 
feature  selection  methods. 

There  is  another  more  elaborate  stepwise  feature  selection  technique  that  is 
implemented  in  the  function  MASS  :  :  stepAIC  ( )  that  is  useful  for  a  wider  range 
of  object  classes. 


17.3  Practice  Problem 

You  can  practice  variable  selection  using  the  SOCR_Data_AD_BiomedBigMetadata 
on  the  SOCR  website.  This  is  a  smaller  dataset  that  has  744  observations  and 
63  variables.  Here  we  utilize  DXCURREN  or  current  diagnostics  as  the  class  variable. 
Let’s  import  the  dataset  first. 

Library (rvest) 

wiki_urL  <-  read_htmL (" http : //wiki . socr . umich.edu/index.php/SOCR_Data_AD_Bio 
medBigMetadata ") 

htmL_nodes (wiki_urLj  "# content ") 

##  {xmL_nodeset  (1)} 

##  [1]  <div  id=" content"  c Las s="mw- body -primary"  roLe="main" >\n\t<a  id="top 

•  •  • 

a Lzh  <-  htmL_tabLe(htmL_nodes(wiki_urLj  "tabLe" ) [ [1] ] ) 
summary ( a Lzh) 


##  SID  mSCORE  FAQTOTAL  GDTOTAL 


## 

Min. 

2.0 

Min. 

:  18.00 

Length : 744 

Min. 

: 0 . 000 

## 

1st  Qu. 

355.5 

1st  Qu. :25. 00 

CLass  : character 

1st  Qu. :0. 000 

## 

Median 

697.5 

Median  : 27.00 

Mode  : character 

Median  : 1.000 

## 

Mean 

707.5 

Mean 

:26. 81 

Mean 

: 1.367 

## 

3rd  Qu. 

1063.0 

3rd  Qu.:29.00 

3rd  Qu. :2.000 

## 

Max. 

1435.0 

Max. 

: 30.00 

Max. 

: 6 . 000 
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##  CDHOME  CDCARE  CDGLOBAL 


## 

Min.  : 0.0000 

Min. 

## 

1st  Qu. :0. 0000 

1st  Qu 

## 

Median  : 0.0000 

Median 

## 

Mean  : 0.2513 

Mean 

## 

3rd  Qu.: 0.5000 

3rd  Qu 

## 

Max.  : 2. 0000 

Max. 

: 0.0000  Min.  : 0.0000 
:0.0000  1st  Qu. :0.0000 
:0.0000  Median  :0.0000 
: 0.2849  Mean  : 0.0672 
:0.5000  3rd  Qu. :0.0000 
: 2 . 0000  Max .  : 2. 0000 


The  data  summary  shows  that  we  have  several  factor  variables.  After  converting 
their  type  to  numeric  we  find  some  missing  data.  We  can  manage  this  issue  by 
selecting  only  the  complete  observation  of  the  original  dataset  or  by  using  multivar¬ 
iate  imputation,  see  Chap.  3. 


chrtofactor< - c (3 j  5 ,  0,  10 ,  21:22,  51:54) 

a Lzh [chrtof actor ]< -data. frame (apply (a Lzh [chrtof actor] j  2,  as . numeric) ) 
a Lzh<- a Lzh[ complete. cases ( a Lzh)j  ] 

For  simplicity,  here  we  eliminated  the  missing  data  and  are  left  with  408  complete 
observations.  Now,  we  can  apply  the  Boruta  method  for  feature  selection. 


##  Boruta  performed  99  iterations  in  9.413648  secs. 

##  12  attributes  confirmed  important :  adascog,  BCBREATH,  CDCARE, 

##  CDCOMMUN,  CDGLOBAL  and  7  more; 

##  4  7  attributes  confirmed  unimportant :  Age,  BC.USEA,  BCABDOMN, 

##  BC ANKLE ,  BCCHEST  and  42  more; 

##  2  tentative  attributes  left :  ApoEGeneAllelel ,  ApoEGeneAllele2  ; 


You  might  get  a  result  that  is  a  little  bit  different.  We  can  plot  the  variable 
importance  graph  using  some  previous  knowledge  (Fig.  17.3). 

The  final  step  is  to  get  rid  of  the  tentative  features. 

##  Boruta  performed  99  iterations  in  9.413648  secs. 

##  Tentatives  roughfixed  over  the  last  99  iterations. 

##  14  attributes  confirmed  important :  adascog,  ApoEGeneAllelel, 

##  ApoEGeneAllele2 ,  BCBREATH,  CDCARE  and  9  more ; 

##  4  7  attributes  confirmed  unimportant :  Age ,  BC.USEA,  BCABDOMN, 

##  BCANKLE,  BCCHEST  and  42  more; 


## 

[1] 

"MMSCORE" 

" FAQTOTAL " 

"adascog" 

## 

[4] 

" sobcdr " 

"DX  Confidence" 

"BCBREATH 

## 

[7] 

"ApoEGeneAllelel " 

"ApoEGeneAllele2 " 

" CDORIENT 

## 

[10] 

" CD JUDGE " 

"CDCOMMUN" 

"CDHOME " 

## 

[13] 

"CDCARE " 

"CDGLOBAL" 

Can  you  reproduce  these  results?  Also  try  to  apply  some  of  these  techniques  to 
other  data  from  the  list  of  our  Case-Studies. 
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Fig.  17.3 


Variable  importance  plot  of  predicting  diagnosis  for  the  Alzheimer’s  disease  case-study 


17.4  Assignment:  17.  Variable/Feature  Selection 
17.4.1  Wrapper  Feature  Selection 

•  Explain  the  three  major  types  of  feature  selection  methods 

•  Filter, 

•  Wrapper,  and 

•  Embedded. 


1 7.4.2  Use  the  PPMI  Dataset 

Use  the  06_PPMI_ClassificationValidationData_Short  dataset  setting 
ResearchGroup  as  class  variable. 

•  Delete  irrelevant  columns  (e.g.  X,  FID_I  ID)  and  select  only  the  PD  and  Control 
cases. 

•  Properly  convert  the  variable  types. 

•  Apply  Boruta  to  train  a  model,  try  different  parameters  (e.g.,  try  different 
pValue,  maxRuns).  What  are  the  differences? 

•  Summarize  and  visualize  the  results. 
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•  Apply  Random  Feature  Elimination  (RFE)  and  tune  the  model  size. 

•  Evaluate  the  Boruta  model  performance  by  comparing  with  REF. 

•  Output  and  compare  the  variables  selected  by  both  methods.  How  much  overlap 
is  there  in  the  selected  variables? 
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Chapter  18 

Regularized  Linear  Modeling 
and  Controlled  Variable  Selection 
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Many  biomedical  and  biosocial  studies  involve  large  amounts  of  complex  data, 
including  cases  where  the  number  of  features  (j k )  is  large  and  may  exceed  the  number 
of  cases  (n).  In  such  situations,  parameter  estimates  are  difficult  to  compute  or  may 
be  unreliable  as  the  system  is  underdetermined.  Regularization  provides  one 
approach  to  improve  model  reliability,  prediction  accuracy,  and  result  interpretabil- 
ity.  It  is  based  on  augmenting  the  primary  fidelity  term  of  the  objective  function  used 
in  the  model-fitting  process  with  a  dual  regularization  term  that  provides  restrictions 
on  the  parameter  space. 

Classical  techniques  for  choosing  important  covariates  to  include  in  a  model  of 
complex  multivariate  data  rely  on  various  types  of  stepwise  variable  selection 
processes,  see  Chap.  17.  These  tend  to  improve  prediction  accuracy  in  certain 
situations,  e.g.,  when  a  small  number  of  features  are  strongly  predictive,  or  associ¬ 
ated,  with  the  clinical  outcome  or  biosocial  trait.  However,  the  prediction  error  may 
be  large  when  the  model  relies  purely  on  a  fidelity  term.  Including  a  regularization 
term  in  the  optimization  of  the  cost  function  improves  the  prediction  accuracy.  For 
example,  below  we  show  that  by  shrinking  large  regression  coefficients,  ridge 
regularization  reduces  overfitting  and  decreases  the  prediction  error.  Similarly,  the 
Least  Absolute  Shrinkage  and  Selection  Operator  (LASSO)  employs  regularization 
to  perform  simultaneous  parameter  estimation  and  variable  selection.  LASSO 
enhances  the  prediction  accuracy  and  provides  a  natural  interpretation  of  the 
resulting  model.  Regularization  refers  to  forcing  certain  characteristics  of 
model-based  scientific  inference,  e.g.,  discouraging  complex  models  or  extreme 
explanations,  even  if  they  fit  the  data  well,  by  enforcing  model  generalizability  to 
prospective  data,  or  restricting  model  overfitting  of  accidental  samples. 

In  this  chapter,  we  extend  the  mathematical  foundation  we  presented  in  Chap.  5 
and  (1)  discuss  computational  protocols  for  handling  complex  high-dimensional 
data,  (2)  illustrate  model  estimation  by  controlling  the  false-positive  rate  of  selection 
of  salient  features,  and  (3)  derive  effective  forecasting  models. 
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18.1  Questions 

•  How  to  deal  with  extremely  high-dimensional  data  (hundreds  or  thousands  of 
features)? 

•  Why  mix  fidelity  (model  fit)  and  regularization  (model  interpretability)  terms  in 
objective  function  optimization? 

•  How  to  reduce  the  false-positive  rate,  increase  scientific  validation,  and  improve 
result  reproducibility  (e.g.,  Knockoff  filtering)? 


18.2  Matrix  Notation 

We  should  review  the  basics  of  matrix  notation,  linear  algebra,  and  matrix  comput¬ 
ing  we  covered  in  Chap.  5.  At  the  core  of  matrix  manipulations  are  scalars,  vectors 
and  matrices. 

•  y{.  output  or  response  variable,  i  =  1,  . . .,  n  (cases/subjects). 

•  Xjj\  input,  predictor,  or  feature  variable,  1  <  j  <  k,  1  <  i  <  n. 


y  = 


l  y\  \ 

y2 

\ynJ 


and 


X  = 


*1,1 

*1,2 

•••  xUk\ 

*2,1 

*2,2 

'  '  '  *2 ,k 

1 

*«,  2 

%n,k  J 

18.3  Regularized  Linear  Modeling 

If  we  assume  that  the  covariates  are  orthonormal,  i.e.,  we  have  a  special  kind  of  a 

T 

design  matrix  X  X  =  /,  then: 

•  The  ordinary  least  squares  (OLS)  estimates  minimize 

(A  J 


and  are  defined  by 
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p0lS=(XTX)-lXTy  =  XTy , 
•  LASSO  estimates  minimize 


mint{h\y-xP\\i  +A \\P 

[Jy 


9 


and  are  defined  as  a  soft- threshold  function  of  the  OLS  estimates: 


=^°LSmax 


9 


where  ^  is  a  soft  thresholding  operator  translating  values  towards  zero,  instead  of 
setting  smaller  values  to  zero  and  leaving  larger  ones  untouched  as  the  hard 
thresholding  operator  would. 

•  Ridge  regression  estimates  minimize  the  following  objective  funciton: 


min 


y  —  xp  |||  ||  p  ||| 


9 


A  J  A  /^r  n 

which  yields  estimates  Pj  =  (1  +A//1)-  /?  •  .  Thus,  ridge  regression  shrinks  all 

coefficients  by  a  uniform  factor,  (1  +  NX)~  ] ,  and  does  not  set  any  coefficients  to  zero. 

•  Best  subset  selection  regression,  also  called  orthogonal  matching  pursuit 
(OMP),  minimizes: 


min 


y  —  xp  |||  -\-A  ||  p 1 1 o 


9 


where  .  0  is  the  “f*  norm”,  defined  as  z 
nonzero.  In  this  case,  the  estimates  are: 


=  m  if  exactly  m  components  of  z  are 


where  Ha  is  a  hard  -  thresholding  function  and  I  is  an  indicator  function  (it  is 
1  if  its  argument  is  true,  and  0  otherwise). 

The  LASSO  estimates  share  features  of  the  estimates  from  both  ridge  and 
best  subset  selection  regression  since  they  both  shrink  the  magnitude  of  all  the 
coefficients,  like  ridge  regression.  However,  LASSO,  also  sets  some  of  them  to  zero, 
as  in  the  best  subset  selection  does.  Ridge  regression  scales  all  of  the  coefficients 
by  a  constant  factor,  whereas  LASSO  translates  the  coefficients  towards  zero  by  a 
constant  value  and  sets  some  of  them  to  zero. 
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18.3.1  Ridge  Regression 

Ridge  regression  relies  on  L  regularization  to  improve  the  model  prediction  accu¬ 
racy.  It  improves  prediction  error  by  shrinking  large  regression  coefficients  to  reduce 
overfitting.  By  itself,  ridge  regularization  does  not  perform  variable  selection  and 
does  not  really  help  with  model  interpretation. 

Let’s  look  at  one  example  using  one  of  our  datasets  01a_data.txt  (Figs.  18.1,  18.2 
and  18.3). 


Fig.  18.1  Plot  of  the  MSE  rate  of  the  ridge-regularized  linear  model  of  MLB  player’s  weight 
against  the  regularization  weight  parameter  2  (log  scale  on  the  x-axis) 


Log  Lambda 

Fig.  18.2  Plot  of  the  effect- size  coefficients  (Age  and  Height)  of  the  ridge-regularized  linear  model 
of  MLB  player’s  weight  against  the  regularization  weight  parameter  2 
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Fig.  18.3  Effect  of  the  regularization  weight  parameter  X  on  the  model  coefficients  (Age  and 
Height)  of  the  ridge-regularized  linear  model  of  MLB  player’s  weight 


#  Data :  https : //umich . instructure . com/courses/38100/f iles/f older/data 
(01a_data.txt) 

data  <-  read, table (  ' https : //umich . instructure . com/ files/330381/down Load Pdown 
Load_frd=l'j  as.is=Tj  header=T) 
attach(data) ;  str(data) 


##  ' data. frame ' : 

##  $  Name  :  chr 

"Kevin_Millar"  ... 
##  $  Team  :  chr 

##  $  Position:  chr 

##  $  Height  :  int 

##  $  Height  :  int 

##  $  Age  :  num 


1034  obs.  of  6  variables : 

"Adam_Donachie"  "PauL_Bako"  "Ramon_Hernandez" 

"BAL"  "BAL"  "BAL"  "BAL"  ... 

"Catcher"  "Catcher"  "Catcher"  "First_Baseman" 
74  74  72  72  73  69  69  71  76  71  .  .  . 

180  215  210  210  188  176  209  200  231  180  ... 

23  34.7  30.8  35.4  35.7  ... 


#  Training  Data 

#  Full  Model:  x  <-  model . matrix (Weight  ~  data  =  data[l:900j  ]) 

#  Reduced  Model 

x  <-  model. matrix (Height  ~  Age  +  Height j  data  =  data[l:900}  ]) 

#  creates  a  design  (or  model)  matrix,  and  adds  1  column  for  outcome 
according  to  the  formula. 

y  <-  data[l:900j  ]$Height 


#  Testing  Data 

x. test  <-  model. matrix(Height  ~  Age  +  Height ,  data  =  data[901:1034j  ]) 

y. test  <-  data [901 : 1034,  ]$Height 

#  install. packages("glmnet") 

Library ( "glmnet" ) 

cv. ridge  <-  cv.glmnet(Xj  y,  type. measure  =  "mse"j  alpha  =  0) 

##  alpha  =1  for  Lasso  onlyj  alpha  =  0  for  ridge  onlyj  and  0<alpha<l  to 
blend  ridge  &  Lasso  penalty  !!!! 
plot( cv. ridge) 
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coef(  cv. ridge) 

##4xl  sparse  Matrix  of  class  "dgCMatrix 


## 


1 


##  (Intercept)  -55.7491733 
##  (Intercept) 


##  Age 
##  Height 


0.6264096 

3.2485564 


sqrt(cv . ridge$cvm[cv . ridge$ lambda  ==  cv . ridge$ Lambda . lse] ) 

##  [1]  17.94358 

#plot  variable  feature  coefficients  against  the  shrinkage  parameter  lambda. 
glmmod  <-glmnet(Xj  y,  alpha  =  0) 
plot(glmmodj  xvar=" Lambda" ) 
grid() 


#  for  plot_glmnet  with  ridge/lasso  coefficient  path  labels 

#  install . packages( "plotmo" ) 

Library (plot mo) 

plot_glmnet (glmmod j  Lwd=4)  #default  colors 


#  More  elaborate  plots  can  be  generated  using: 

#  plot_glmnet(glmmod , label=2, lwd=4)  #label  the  2  biggest  final  coefs 

#  specify  color  of  each  line 

#  g  <-  "blue" 

#  plot_glmnet(glmmod ,  lwd=4,  col=c(2,g)) 

#  report  the  model  coefficient  estimates 
coef  (glmmod) [j  1] 

##  (Intercept)  (Intercept)  Age  Height 

##  2. 016556e+02  0.000000e+00  8.327372e-37  4.789383e-36 

cv. glmmod  <-  cv ,glmnet(x}  y}  alpha=0) 

mod. ridge  <-  cv.glmnet(Xj  y}  alpha  =  0 ,  thresh  =  le-12) 
lambda . best  <-  mod. ridge$Lambda .min 
Lambda . best 

##  [1]  1.192177 

ridge. pred  <-  predict (mod . ridge j  newx  =  x.testj  s  =  Lambda . best) 
ridge. RMS  <-  mean((y .test  -  ridge. pred) *2) ;  ridge. RMS 

##  [1]  264.083 

ridge,  test.  r2  <-  1  -  mean(  (y  .test  -  ridge.  pred)/K2)/mean(  (y  .test  - 

mean(y. test) )A2) 

#  plot (cv. glmmod) 

best_Lambda  <-  cv .glmmod$ Lambda. min 
best_Lambda 


##  [1]  1.192177 
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In  the  plots  above,  different  colors  represents  the  vector  of  features,  and  the 
corresponding  coefficients  are  displayed  as  a  function  of  the  regularization  param¬ 
eter,  A.  The  top  axis  indicates  the  number  of  nonzero  coefficients  at  the  current  value 
of  A.  For  LASSO  regularization,  this  top-axis  corresponds  to  the  effective  degrees  of 
freedom  (df)  for  the  model. 

Notice  the  usefulness  of  Ridge  regularization  for  model  estimation  in  highly 
ill-conditioned  problems  (n  k )  where  slight  feature  perturbations  may  cause  dis¬ 
proportionate  alterations  of  the  corresponding  weight  calculations.  When  A  is  very 
large,  the  regularization  effect  dominates  the  optimization  of  the  objective  function 
and  the  coefficients  tend  to  zero.  At  the  other  extreme,  as  A  — *  0,  the  resulting  model 
solution  tends  towards  the  ordinary  least  squares  (OLS)  and  the  coefficients  exhibit 
large  oscillations.  In  practice,  we  often  need  to  tune  A  to  balance  this  tradeoff. 

Also  note  that  in  the  cv.glmnet  call,  alpha  =  0  (ridge)  and  alpha  =  1 
(LASSO)  correspond  to  different  types  of  regularization,  and  0  <  alpha  <  1  corre¬ 
sponds  to  elastic  net  blended  regularization. 


18.3.2  Least  Absolute  Shrinkage  and  Selection  Operator 
(LASSO)  Regression 

Estimating  the  linear  regression  coefficients  in  a  linear  regression  model  using 
LASSO  involves  minimizing  an  objective  function  that  includes  an  L1  regularization 
term  which  tends  to  shrink  the  number  of  features.  A  descriptive  representation  of 
the  fidelity  (left)  and  regularization  (right)  terms  of  the  objective  function  are  shown 
below: 


n 


E 

i=  1 


i  2 


■v-  a>  E  Pi 

7=1 


■V; 


IJ 


H-  A 

reg.  weight 


fidelity  term 


regilarization  term 


LASSO  jointly  achieves  model  quality,  reliability  and  variable  selection  by 
penalizing  the  sum  of  the  absolute  values  of  the  regression  coefficients.  This  forces 
the  shrinkage  of  certain  coefficients  effectively  acting  as  a  variable  selection  process. 
This  is  similar  to  ridge  regression’s  penalty  on  the  sum  of  the  squares  of  the 
regression  coefficients,  although  ridge  regression  only  shrinks  the  magnitude  of 
the  coefficients  without  truncating  them  to  0. 

Let’s  show  how  to  select  the  regularization  weight  parameter  A  using  training 
data  and  report  the  error  (e.g.,  MSE)  using  testing  data. 
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mod. Lasso  <-  cv.glmnet(x,  y ,  alpha  =  1,  thresh  =  le-12) 

##  alpha  =1  for  Lasso  only,  alpha  =  0  for  ridge  only,  and  0<alpha<l  for  elas 
tic  net ,  a  blend  ridge  &  Lasso  penalty  !!!! 

Lambda. best  <-  mod . lasso$lambda .min 
Lambda . best 

##  [1]  0.05933494 

Lasso. pred  <-  predict (mod.  Lasso,  newx  =  x.test,  s  =  Lambda . best) 

LASSO. MSE  <-  mean( (y .test  -  Lasso. pred) A2) ;  LASSO. MSE 
##  [1]  261.8194 

Let’s  retrieve  the  estimates  of  the  model  coefficients. 

mod. Lasso  <-  gLmnet(x,  y,  alpha  =  1) 

predict (mod . Lasso,  s  =  Lambda . best,  type  =  "coefficients" ) 

##4xl  sparse  Matrix  of  class  "dgCMatrix" 

##  1 

##  (Intercept)  -181.9254079 
##  (Intercept) 

##  Age  0.9654354 

##  Height  4.8284803 

Lasso. test . r2  <-  1  -  mean( (y .test  -  Lasso. pred)A2)/mean( (y .test  - 

mean(y. test) )A2) 

Perhaps  obtain  a  classical  OLS  linear  model,  as  well. 

Lm.fit  <-  Lm(Height  ~  Age  +  Height,  data  =  data[l:900,  ]) 
s ummary (Lm.fit) 

## 

##  Call: 

##  Lm( formula  =  Height  ~  Age  +  Height,  data  =  data[l : 900,  ]) 

## 

##  Residuals : 

##  Min  IQ  Median  3Q  Max 

##  -50.602  -12.399  -0.718  10.913  74.446 
## 

##  Coefficients : 

##  Estimate  Std.  Error  t  value  Pr(>lt I) 

##  (Intercept)  -184.3736  19.4232  -9.492  <  2e-16  *** 

##  Age  0.9799  0.1335  7.341  4.74e-13  *** 

##  Height  4.8561  0.2551  19.037  <  2e-16  *** 

##  --- 

##  Signif.  codes:  0  '***'  0.001  '**'  0.01  0.05  '.'0.1  '  '  1 

## 

##  Residual  standard  error:  17.5  on  897  degrees  of  freedom 
##  Multiple  R-squared :  0.3088,  Adjusted  R-squared :  0.3072 

##  F-statistic :  200.3  on  2  and  897  DF,  p-value:  <  2.2e-16 

The  OLS  linear  (unregularized)  model  has  slightly  larger  coefficients  and  greater 
MSE  than  LASSO,  which  attests  to  the  shrinkage  of  LASSO  (Lig.  18.4). 
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Testing  Data  Derived  R-squared 


OLS  LASSO  Ridge 


Fig.  18.4  Comparison  of  the  coefficients  of  determination  ( R  )  for  three  alternative  models 


Table  18.1  Results  of  the  test  dataset  errors  (MSE)  for  the  three  methods 


LM 

LASSO 

Ridge 

305.1995 

261.8194 

264.083 

Lm.pred  <-  predict ( Lm. fit ,  newx  =  x.test) 

LM.MSE  <-  mean((y  -  Lm.pred)^2);  LM.MSE 

##  [1]  305.1995 

Lm.test .r2  <-  1  -  mean((y  -  Lm. pred)/K2)  /  mean( (y .test  -  mean(y .test) )/K2) 

barpiot(c( Lm.test . r2j  Lasso. test . r2,  ridge. test . r2) }  coi  =  "red"j  names. arg 
=  c("OLS"j  "LASSO" j  "Ridge")j  main  =  "Testing  Data  Derived  R-squared" ) 

Compare  the  results  of  the  three  alternative  models  (LM,  LASSO  and  Ridge)  for 
these  data  and  contrast  the  derived  MSE  results  (Table.  18.1). 

Library (knitr)  #  kable  function  to  convert  tabular  R-results  into  Rmd  tables 

#  create  table  as  data  frame 

MSE_TabLe  =  data. frame (LM=  LM.MSE ,  LASSO=LASSO.MSEJ  Ridge=ridge . RISE ) 

#  convert  to  markdown 

kabie(MSE_TabLej  format="pandoc" }  caption="ResuLts  of  test  dataset  errors' 
aLign=c("c"j  "c",  "c")) 

As  both  the  inputs  (features  or  predictors)  and  the  output  (response)  are  observed 
for  the  testing  data,  we  can  learn  the  relationship  between  the  two  types  of  features 
(controlled  covariates  and  observable  responses).  Most  often,  we  are  interested  in 
forecasting  or  predicting  of  responses  using  prospective  (new,  testing,  or 
validation)  data. 
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18.3.3  Predictor  Standardization 

Prior  to  fitting  regularized  linear  modeling  and  estimating  the  effects,  co variates  may 
be  standardized.  This  can  be  accomplished  by  using  the  classic  “z-score”  formula. 
This  puts  each  predictor  on  the  same  scale  (unitless  quantities)  -  the  mean  is  0  and  the 
variance  is  1 .  We  use  /3  0  =  y,  for  the  mean  intercept  parameter,  and  estimate  the 
coefficients  or  the  remaining  predictors.  To  facilitate  interpretation  of  the  model  or 
results  in  the  context  of  the  specific  case-study,  we  can  transform  the  results  back  to 
the  original  scale/units  after  the  model  is  estimated. 


18.3.4  Estimation  Goals 


The  basic  setting  here  is:  given  a  set  of  predictors  X,  find  a  function, /(X),  to  model  or 
predict  the  outcome  Y. 

Let’s  denote  the  objective  (loss  or  cost)  function  by  L(y,/(X)).  It  determines 
adequacy  of  the  fit  and  allows  us  to  estimate  the  squared  error  loss: 

L{y,f{X))  =  (y-f{X))2. 

We  are  looking  to  find /that  minimizes  the  expected  loss: 


E 


(J  -/(V)2 


=>f  =  E[Y \X  =  x 


18.4  Linear  Regression 

Let’s  assume  that: 


Y i  —  Po  +  xnpl  +  Xi202  +  •  •  •  +  xipPp  H-  e. 

•  In  shorthand  matrix  notation,  that  is:  Y  =  X/?  +  e. 

•  And  the  expectation  of  the  observed  outcome  given  the  data,  E[Y\X  =  x],  is  a 
linear  function,  which  in  certain  situations  can  be  expressed  as: 

n  ^  /  p  ^  \  2  n  ^  / 

argmin  ^  (  v,  -  ^  xyfijj  =  argimn  ^  (>>,  -  xj P 
Remember  that  matrix  multiplication  is  not  always  commutative.  Multiplying  on 

T  r 

the  left  both  hand  sides  by  X  =  X ,  the  transpose  of  the  design  matrix  X,  yields: 


XrY  =  XT{Xp)  =  {XTX)p. 
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To  solve  for  the  effect- sizes  /?,  we  can  multiply  both  sides  of  the  equation  by  the 
inverse  of  its  (right  hand  side)  multiplier: 

(xrx)_1(xry)  =  (xTxy\xTx)p  =  p. 

The  ordinary  least  squares  (OLS)  estimate  of  /3  is  given  by: 


p 


i=  1 

OLS 


'''  r  i 

argmm  ^  yi~J2xijPj  =argmin||  y  -  Xfi 


7=1 


=  (x'x)  X'y  =>■/  (xi)  =  x! [) 


18.4.1  Drawbacks  of  Linear  Regression 

Despite  its  wide  use  and  elegant  theory,  linear  regression  has  some  shortcomings. 

•  Prediction  accuracy  -  Often  can  be  improved  upon; 

•  Model  interpretability  -  Linear  model  does  not  automatically  do  variable 
selection. 

18.4.2  Assessing  Prediction  Accuracy 

Given  a  new  input,  x0,  how  do  we  assess  our  prediction  /  (xq)1 
The  Expected  Prediction  Error  (EPE)  is: 

EPE(x 0)  =E[(Y0-f(x o))2 

=  Var(e)  +  Var(f  (x0))  +Bias(/'(x0))2’ 

=  Var(e)  +  MSE(f  (x0)) 

where 

•  Var(6):  irreducible  error  variance 

•  Var(/  (vo)) :  sample-to- sample  variability  off  (xo),  and 

•  Bias(/“  (jcq))  :  average  difference  off  (xq)  &/(v0). 

18.4.3  Estimating  the  Prediction  Error 

Common  approaches  to  estimating  prediction  errors  include: 
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•  Randomly  splitting  the  data  into  “training”and  “testing”sets,  where  the  testing 
data  has  m  observations  that  will  be  used  to  independently  validate  the  model 

quality.  We  estimate/calculate/  using  training  data; 

•  Estimating  prediction  error  using  the  testing  set  MSE 

i=  1 

Ideally,  we  want  our  model/predictions  to  perform  well  with  new  or 
prospective  data. 


18.4.4  Improving  the  Prediction  Accuracy 

If  fix)  ~  linear,  /  will  have  low  bias  but  possibly  high  variance,  e.g.,  in  high¬ 
dimensional  setting  due  to  correlated  predictors,  over  (k  features  <C  n  cases),  or 
underdetermination  (k>n).  The  goal  is  to  minimize  total  error  by  trading  off  bias  and 
precision: 


MSE  O'  (x))  =  Var  (f  (x))  +  Bias(f  (x))2 . 

We  can  sacrifice  bias  to  reduce  variance,  which  may  lead  to  a  decrease  in  MSE. 
So,  regularization  allows  us  to  tune  this  tradeoff. 

We  aim  to  predict  the  outcome  variable,  Yn  x  1?  in  terms  of  other  features  Xn  k. 
Assume  a  first-order  relationship  relating  Y  and  X  is  of  the  form  Y  =fiX)  +  e,  where 

/s 

the  error  term  e  ~  N( 0,  <r).  An  estimate  model  /  (X)  can  be  computed  in  many 
different  ways  (e.g.,  using  least  squares  calculations  for  linear  regressions,  Newton- 
Raphson,  steepest  decent  and  other  methods).  Then,  we  can  decompose  the  expected 
squared  prediction  error  at  v  as: 


E(x)  =  E 


(Y-f(x))2}  =  (E\f{x)]  -f(x))2  +E\(f(x) -E[f(x)])2 
-I  ^ 


"V*" 


Bias1 


- S/' - 

precision  (variance) 


+ 

irreducible  error  (noise) 


When  the  true  Y  vs.  X  relation  is  not  known,  infinite  data  may  be  necessary  to 
calibrate  the  model  /  and  it  may  be  impractical  to  jointly  reduce  both  the  model 
bias  and  variance.  In  general,  minimizing  the  bias  at  the  same  time  as  minimizing 
the  variance  may  not  be  possible. 

Figure  18.5  illustrates  diagrammatically  the  dichotomy  between  bias  (accuracy) 
and  precision  (variability).  Additional  information  is  available  in  the  SOCR  SMHS 
EBook. 
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Precise  &  Unbiased  Precise  &  Biased 


Imprecise  &  Unbiased  Imprecise  &  Biased 


Fig.  18.5  Graphical  representation  of  the  four  extreme  scenarios  for  bias  and  precision 


18.4.5  Variable  Selection 

Oftentimes,  we  are  only  interested  in  using  a  subset  of  the  original  features  as  model 
predictors.  Thus,  we  need  to  identify  the  most  relevant  predictors,  which  usually 
capture  the  big  picture  of  the  process.  This  helps  us  avoid  overly  complex  models 
that  may  be  difficult  to  interpret.  Typically,  when  considering  several  models  that 
achieve  similar  results,  it’s  natural  to  select  the  simplest  of  them. 

Linear  regression  does  not  directly  determine  the  importance  of  features  to  predict 
a  specific  outcome.  The  problem  of  selecting  critical  predictors  is  therefore  very 
important. 

Automatic  feature  subset  selection  methods  should  directly  determine  an  optimal 
subset  of  variables.  Forward  or  backward  stepwise  variable  selection  and  forward 
stagewise  are  examples  of  classical  methods  for  choosing  the  best  subset  by 
assessing  various  metrics  like  MSE ,  Cp,  AIC,  or  BIC,  see  Chap.  17. 
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18.5  Regularization  Framework 

As  before,  we  start  with  a  given  X  and  look  for  a  (linear)  function, /(X),  to  model  or 
predict  y  subject  to  certain  objective  cost  function,  e.g.,  squared  error  loss.  Adding  a 
second  term  to  the  cost  function  minimization  process  yields  (model  parameter) 
estimates  expressed  as: 


n 


P 


p  w  =  arg™n  Y  -  Y  XijPj)  +  u  w 


In  the  above  expression,  A  >  0  is  the  regularization  (tuning  or  penalty)  parameter, 
J(fl)  is  a  user- defined  penalty  function  -  typically,  the  intercept  is  not 
penalized. 


18.5.1  Role  of  the  Penalty  Term 


Consider  J(fi)  =  Y^\  Pj  =ll  P  111  (Ridge  Regression,  RR).  Then,  the  formulation 


of  the  regularization  framework  is: 


n 


P 


p  wRR  =  arg™n  { Y\  -  YxvPj  )  +  A  YpJ 


i=  1  V  7=1 


7=1 


Or,  alternatively: 


n 


P 


p  ^)RR  =  arg™n  Y  ( yt  ~  YXijPj 


i=  1  V  7=1 


subject  to 


#v 

E/f  < 


7=1 


18.5.2  Role  of  the  Regularization  Parameter 

The  regularization  parameter  A  >  0  directly  controls  the  bias-variance  trade-off: 

•  A  —  0  corresponds  to  OLS,  and 

•  A  — »  oo  puts  more  weight  on  the  penalty  function  and  results  in  more  shrinkage  of 
the  coefficients,  i.e.,  we  introduce  bias  at  the  sake  of  reducing  the  variance. 
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The  choice  of  A  is  crucial  and  will  be  discussed  below  as  each  A  results  in  a 
different  solution  fi  (A). 


18.5.3  LASSO 

The  LASSO  (Least  Absolute  Shrinkage  and  Selection  Operator)  regularization 
relies  on: 


k 

m  =  ^2\kj\=\\k\\u 

7=1 

which  leads  to  the  following  objective  function: 

k  \  ^  ^ 

i  -  x‘jkj )  +/lL]i  Pj 

j=  1  /  7=1 

In  practice,  subtle  changes  in  the  penalty  terms  frequently  lead  to  big  differences 
in  the  results.  Not  only  does  the  regularization  term  shrink  coefficients  towards  zero, 
but  it  sets  some  of  them  to  be  exactly  zero.  Thus,  it  performs  continuous  variable 
selection,  hence  the  name,  Least  Absolute  Shrinkage  and  Selection  Operator 
(LASSO). 

For  further  details,  see  “Tibshirani’s  LASSO  Page”. 


18.5.4  General  Regularization  Framework 

The  general  regularization  framework  involves  optimization  of  a  more  general 
objective  function: 


min  {L(yi,f(xi))  +AJ(f)}, 

J  i=  1 

where  TL  is  a  space  of  possible  functions,  L  is  ihe  fidelity  term ,  e.g.,  squared  error, 
absolute  error,  zero-one,  negative  log-likelihood  (GLM),  hinge  loss  (support  vector 
machines),  and  /  is  the  regularizer ,  e.g.,  ridge  regression,  LASSO,  adaptive  LASSO, 
group  LASSO,  fused  LASSO,  thresholded  LASSO,  generalized  LASSO, 
constrained  LASSO,  elastic-net,  Dantzig  selector,  SCAD,  MCP,  smoothing 
splines,  etc. 

This  represents  a  very  general  and  flexible  framework  that  allows  us  to  incorpo¬ 
rate  prior  knowledge  (sparsity,  structure,  etc.)  into  the  model  estimation. 
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18.6  Implementation  of  Regularization 

18.6.1  Example:  Neuroimaging-Genetics  Study 
of  Parkinson ’s  Disease  Dataset 

More  information  about  this  specific  study  and  the  included  derived  neuroimaging 
biomarkers  is  available  online.  A  link  to  the  data  and  a  brief  summary  of  the  features 
are  included  below: 

#  05_PPMI_top_UPDRS_Integrated_LongFormat  1  .csv. 

#  Data  elements  include:  FID_IID,  L_insular_cortex_ComputeArea, 

L_insular_cortex_V olume,  R_insular_cortex_ComputeArea,  R_insular_cortex_ 
Volume,  L_cingulate_gyrus_ComputeArea,  L_cingulate_gyrus_  Volume, 

R_cingulate_gyrus_ComputeArea,  R_cingulate_gyrus_  V  olume,  L_caudate_ 
Compute  Area,  L_caudate_Volume,  R_caudate_ComputeArea,  R_caudate_ 
Volume,  L_putamen_ComputeArea,  L_putamen_Volume,  R_putamen_ 
Compute  Area,  R_putamen_Volume,  Sex,  Weight,  ResearchGroup,  Age, 
chr  1 2_rs34637 5 84_GT,  chrl7_rsl  1868035_GT,  chrl7_rsl  1012_GT,  chrl7_ 
rs393152_GT,  chrl7_rsl2185268_GT,  chrl7_rsl99533_GT,  UPDRS_part_I, 
UPDRS_part_II,  UPDRS_part_III,  time_visit. 

Note  that  the  dataset  includes  missing  values  and  repeated  measures. 

The  goal  of  this  demonstration  is  to  use  OLS,  ridge  regression,  and  the 
LASSO  to  find  the  best  predictive  model  for  the  clinical  outcomes  -  UPRDR 
score  (vector)  and  Research  Group  (factor  variable),  in  terms  of  demographic, 
genetics,  and  neuroimaging  biomarkers. 

We  can  utilize  the  glmnet  package  in  R  for  most  calculations. 

####  Initial  Stuff  #### 

#  clean  up 
rm( List=Ls() ) 

#  load  required  packages 

#  install. packages("arm") 

Library (glmnet) 
library (arm) 

Library (knitr)  #  kable  function  to  convert  tabular  R-results  into  Rmd  tables 

#  pick  a  random  seed,  but  set . seed(seed)  only  effects  next  block  of  code! 
seed  =  1234 

####  Organize  Data  #### 

#  load  dataset 

#  Data :  https : // umich . instructure. com/courses/ 38100/files/folder /data 

#  (05_PPMI_top_UPDRS_Integrated_LongFormatl . csv) 

datal  <-  read.table(  ' https : //umich .  instructure.  com/files/330397/download?doiA/ 
nload_frd=l '  j  sep=”,"J  header=T) 

#  we  will  deal  with  missing  values  using  multiple  imputation  later.  For  now, 
let's  just  ignore  incomplete  cases 

datal. comp  Let eRowIndexes  <-  complete. cases (datal) ; 

tabLe( datal . completeRowIndexes ) 
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##  datal .  compieteRoiA/Indexes 
##  FALSE  TRUE 
##  609  1155 

prop .tabLe(tabLe(datal . compLeteRowIndexes) ) 

##  datal .  compieteRoiA/Indexes 
##  FALSE  TRUE 

##  0.3452381  0.6547619 

attach (datal ) 

#  View(datal[datal . completeRowIndexes,  ]) 

#  define  response  and  predictors 

y  <-  datal$UPDRS_part_I  +  datal$UPDRS_part_II  +  datal$UPDRS_part_III 
tabie(y)  #  Show  Clinically  relevant  classification 

##  y 

##0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24 

##54  20  25  12  8  7  11  16  16  9  21  16  13  13  22  25  21  31  25  29  29  28  20  25  28 

##25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49 

##26  35  41  23  34  32  31  37  34  28  36  29  27  22  19  17  18  18  19  16  9  10  12  9  11 

##50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  66  68  69  71  75  80  81  82 

##7  10  11  57  415943216121211231 

y  <-  y [datal. compLeteRowIndexes] 

#  X  =  scale(datal[ , ] )  #  Explicit  Scaling  is  not  needed,  as  glmnet  auto 

standardizes  predictors 

#  X  =  as.matrix(datal[,  c("R_caudate_Volume",  "R_putamen_Volume" ,  "Weight", 

"Age",  "chrl7_rsl2185268_GT" ) ] )  #  X  needs  to  be  a  matrix,  not  a  data  frame 

drop_features  <-  c("FID_IID"j  "ResearchGroup' ,  "PDRS_part_I" , 

"  UPDRS_part_II ",  "UPDRS_part_III ") 

X  <-  datal [  j  ! (names (datal)  %in%  drop_features) ] 

X  =  as.matrix(X)  #  remove  columns:  index,  ResearchGroup,  and 
y=(PDRS_part_I  +  UPDRS_part_II  +  UPDRS_part_III) 

X  <-  X[datal.  compLeteRoiA/IndexeSj  ] 
summary  (X) 


## 

L_insuLar_cortex_ComputeArea  L_insuLar_cortex_VoLume 

## 

Min. 

:  50.03 

Min. 

22.63 

## 

1st  Qu. 

: 2174. 57 

1st  Qu. : 

5867.23 

## 

Median 

: 2522. 52 

Median  : 

7362.90 

## 

Mean 

: 2306. 89 

Mean 

6710.18 

## 

3rd  Qu. 

: 2752. 17 

3rd  Qu. : 

8483.80 

## 

Max. 

: 3650. 81 

Max. 

13499.92 

## 

chrl7_rs393152_GT 

chrl7_rsl2185268_GT 

chrl7_rsl99533_GT  UPDRS_part_I 

## 

Min. 

:0. 0000 

Min. 

:0. 0000 

Min.  : 0.0000 

Min. 

0.  000 

## 

1st  Qu. 

:0. 0000 

1st  Qu. 

: 0 . 0000 

1st  Qu. :0. 0000 

1st  Qu. 

0.000 

## 

Median 

:0. 0000 

Median 

: 0 . 0000 

Median  : 0.0000 

Median 

1.000 

## 

Mean 

: 0.4468 

Mean 

: 0.4268 

Mean  : 0 . 4052 

Mean 

1.306 

## 

3rd  Qu. 

: 1 . 0000 

3rd  Qu. 

: 1 . 0000 

3rd  Qu. : 1.0000 

3rd  Qu. 

2.000 

## 

Max. 

: 2 . 0000 

Max. 

: 2 . 0000 

Max.  : 2. 0000 

Max. 

13.000 

## 

time_ 

visit 

## 

Min. 

:  0.00 

## 

1st  Qu. 

:  9.00 

## 

Median 

: 24 . 00 

## 

Mean 

: 23 . 83 

## 

3rd  Qu. 

:36. 00 

## 

Max. 

: 54 . 00 
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#  randomly  split  data  into  training  (80%)  and  test  (20%)  sets 
set.seed(seed) 

train  =  sampLe(l  :  nrow(X)j  round( (4/5)  *  nrow(X))) 
test  =  -train 

#  subset  training  data 
yTrain  =  y[train] 

XTrain  =  X[ train j  ] 

XTrainOLS  =  cbind(rep(lj  nrow(XTrain) ) ,  XTrain) 

#  subset  test  data 
yTest  =  y[test] 

XTest  =  X[test,  ] 

####  Model.  Estimation  &  Selection  #### 

#  Estimate  models 

fitOLS  =  Lm(yTrain  ~  XTrain)  #  Ordinary  Least  Squares 

#  glmnet  automatically  standardizes  the  predictors 

fitRidge  =  glmnet  (XTrain }  yTrain ,  alpha  =  0)  #  Ridge  Regression 
fitLASSO  =  glmnet (XTrain j  yTrainj  alpha  =  1)  #  The  LASSO 

Readers  are  encouraged  to  compare  the  two  models,  ridge  and  LASSO. 


18.6.2  Computational  Complexity 

Recall  that  the  regularized  regression  estimates  depend  on  the  regularization  param¬ 
eter  A.  Fortunately,  efficient  algorithms  for  choosing  optimal  A  parameters  do  exist. 
Examples  of  solution  path  algorithms  include: 

•  LARS  Algorithm  for  the  LASSO  (Efron  et  al.  2004) 

•  Piecewise  linearity  (Rosset  and  Zhu  2007) 

•  Generic  path  algorithm  (Zhou  and  Wu  2013) 

•  Pathwise  coordinate  descent  (Friedman  et  al.  2007) 

•  Alternating  Direction  Method  of  Multipliers  (ADMM)  (Boyd  et  al.  2011) 

We  will  show  how  to  visualize  the  relations  between  the  regularization  parameter 
(ln(2))  and  the  number  and  magnitude  of  the  corresponding  coefficients  for  each 
specific  regularized  regression  method. 


18.6.3  LASSO  and  Ridge  Solution  Paths 

Figures  18.6  and  18.7  show  plots  of  the  LASSO  results  and  are  obtained  using  the 
R  script  below.  Note  that  the  top-horizontal  axis  lables  indicate  the  number  of 
non-trivial  parameters  in  the  resulting  model  corresponding  to  the  log(2),  which  is 
labeled  on  the  bottom-horizontal  axis. 
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LASSO  regularizes  Number  of  Nonzero  (Active)  Coefficients 
27  22  15  5  0 


Log  Lambda 

Fig.  18.6  Relations  between  LASSO-regularized  model  coefficient  sizes  (y-axis),  magnitude  of  the 
regularization  parameter  (bottom  axis),  and  the  efficacy  of  the  model  selection,  i.e.,  number  of 
non-trivial  coefficients  (bottom  axis) 


Ridge  regularizes  Number  of  Nonzero  (Active)  Coefficients 


27  27  27  27  27 


Fig.  18.7  Relations  between  Ridge-regularized  model  coefficient  sizes  (y-axis),  magnitude  of  the 
regularization  parameter  (bottom  axis),  and  the  efficacy  of  the  model  selection,  i.e.,  number  of 
non-trivial  coefficients  (bottom  axis) 
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###  Plot  Solution  Path  ### 

#  LASSO 

plot ( fit LASSO j  xvar=" Lambda" j  LabeL="TRUE" ) 

#  add  label  to  upper  x-axis 

mtext(" LASSO  reguLarizer:  Number  of  Nonzero  (Active)  Coefficients" j 
side=3j  Line=2.5) 

Similarly,  the  plot  for  the  Ridge  regularization  can  be  obtained  by: 

###  Plot  Solution  Path  ### 

#  Ridge 

pLot(fitRidgej  xvar=" Lambda" j  LabeL="TRUE" ) 

#  add  label  to  upper  x-axis 

mtext( "Ridge  reguLarizer:  Number  of  Nonzero  (Active)  Coefficients" j 
side=3j  Line=2.5) 

Let’s  try  to  compare  the  paths  of  the  LASSO  and  Ridge  regression  solutions. 
Below,  you  will  see  that  the  curves  of  LASSO  are  steeper  and  non-differentiable  at 
some  points,  which  is  the  result  of  using  the  Ly  norm.  On  the  other  hand,  the  Ridge 
path  is  smoother  and  asymptotically  tends  to  0  as  A  increases. 

Let’s  start  by  examining  the  joint  objective  function  (including  LASSO  and 
Ridge  terms): 


min 

P 


\P\\\  +  a\\P 


5 


p 


where  H/^!  =  ^  I  Pj  Iandll^ll2  = 


j=  i 


\ 


P 


|  \Pj  are  the  norms  of  ft  corresponding  to 


7=1 


the  L{  and  L2  distance  measures,  respectively.  When  a  =  0  and  a  =  1  correspond  to 
Ridge  and  LASSO  regularization.  The  following  two  natural  questions  raise: 


•  What  if  0  <  a  <  1  ? 

•  How  does  the  regularization  penalty  term  affect  the  optimal  solution? 


In  Chap.  10,  we  explored  the  minimal  SSE  (Sum  of  Square  Error)  for  the  OLS 
(without  penalty)  where  the  feasible  parameter  (fi)  spans  the  entire  real  solution 
space.  In  penalized  optimization  problems,  the  best  solution  may  actually  be 
unachievable.  Therefore,  we  look  for  solutions  that  are  “closest”,  within  the  feasible 
region,  to  the  enigmatic  best  solution. 

The  effect  of  the  penalty  term  on  the  objective  function  is  separate  from  the 
fidelity  term  (OLS  solution).  Thus,  the  effect  of  0  <  a  <  1  is  limited  to  the  size  and 
shape  of  the  penalty  region.  Let’s  try  to  visualize  the  feasible  region  as: 


•  centrosymmetric,  when  a  =  0,  and 

•  super  diamond,  then  a  =  1 . 


Here  is  a  hands-on  example: 
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require(needs) 

#  Constructing  Quadratic  Formula 
result  <-  function(cij  bj  c){ 

if(delta(ajbjC)  >  0){  #  first  case  D>0 
x_l  =  (-b+sqrt(deLta(aj  bj  c)))/(2*a) 
x_2  =  (-b-sqrt(deLta(aj  bj  c)))/(2*a) 
result  =  c(x_lj x_2) 

} 

else  if(delta(cijbjc)  ==  0){  #  second  case  D=0 
x  =  -b/(2*a) 

} 

else  {"There  are  no  real  roots. "}  #  third  case  D<0 

> 

#  Constructing  delta 
delta<-function(aj bj c){ 

b/K2-4*a*c 

} 


To  make  this  realistic,  we  will  use  the  MLB  dataset  to  first  fit  an  OLS  model.  The 
dataset  contains  1,034  records  of  heights  and  weights  for  some  current  and  recent 
Major  League  Baseball  (MLB)  Players. 

•  Height :  Player  height  in  inches, 

•  Weight :  Player  weight  in  pounds, 

•  Age :  Player  age  at  time  of  record. 

Then,  we  can  obtain  the  SSE  for  any  \\/3\\: 


SSE  = 


Y-Y 


(Y  -Y)T {Y  -Y)  =  YTY  -2fXrY  +  fxrxp. 


Next,  we  will  compute  the  SSE  contours  in  several  situations. 


I ibrary ( "ggp Lot  2") 

#  load  data 

mlb<-  read,  table (  ' https :// umich .  instructure.  com/files/330381/do\AinLoad?do\Ainlo 
ad_frd=l'j  as.is=Tj  header=T) 
str(mlb ) 

##  ' data  .frame ' :  1034  obs.  of  6  variables : 

##  $  Name  :  chr  "Adam_Donachie"  "Paul_Bako"  "Ramon_Hernandez" 

"Kevin_Millar"  ... 

##  $  Team  :  chr  "BAL"  "BAL"  "BAL"  "BAL"  ... 

##  $  Position:  chr  "Catcher"  "Catcher"  "Catcher"  "First_Baseman"  ... 

##  $  Height  :  int  74  74  72  72  73  69  69  71  76  71  ..  . 

##  $  Height  :  int  180  215  210  210  188  176  209  200  231  180  ... 

##  $  Age  :  num  23  34.7  30.8  35.4  35.7  ... 

fit<-lm(Height~Neight+Age-lj  data  =  as.data.frame(scale(mlb[j4:6]))) 
points  =  data.frame(x=c(0jfit$coefficients[l] )jy=c(0jfit$coefficients[2])j 
z=c("(0J0)"J  "OLS  Coef") ) 

Y=scale(mlb$Height) 

X  =  scale(mlb[j  c(5j6) ] ) 

betal=seq(-0. 556j  1.556}  Length .out  =  100) 
beta2=seq(-0.661j  0.3386 }  Length .out  =  100) 
df  <-  expand. grid (betal  =  betalj  beta2  =  beta2) 
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b  =  as. matrix (df) 

df$sse  <-  rep(t(Y)%*%Y,  100*100)  -2*b%*%t(X)%*%Y  +  diag(b%*%t(X)%*%X%*%t(b)) 
base  <-  ggpLot(df)  + 

stat_contour(aes(betal,  beta2j  z  =  sse), breahs  =  round(quantiLe(df$ssej 
seq(0j  0.2,  0.03)),  0), 

size  =  0. 5, coior="darkorchid2" , aipha=0. 8)+ 
scaLe_x_continuous( Limits  =  c(-0.4,l))+ 
scaLe_y_continuous( Limits  =  c(-0. 55,0.4))+ 
coord_fixed ( ratio=l )  + 
geom_point(data  =  points ,aes(x,y))+ 

geom_text(data  =  points,aes(x,y,  LabeL=z),vjust  =  2,  size=3 . 5)+ 
geom_segment(aes(x  =  -0.4,  y  =  0,  xend  =  1,  yend  =  0),coLour  =  "grey 46", 

arrow  =  arrow( Length=unit(0. 30,  "cm")),  size=0. 5,aLpha=0.8)  + 
geom_segment(aes(x  =  0,  y  =  -0.55,  xend  =  0,  yend  =  0 .4) , coLour="grey46" , 

arrow  =  arrow( Length=unit(0. 30,  "cm")),  size=0. 5,aLpha=0.8) 

pLot_aLpha  =  function(aLpha=0,restrict=0.2,betal_range=0.2 , 
annot=c(0. 15,-0. 25,0.205,  -0.  05) ){ 

a=aLpha;  t=restrict ;  k=betal_range;  pos=data. frame (Vl=annot [1:4]) 
tex=paste("(",as.character(annot[3]  ),  ",  ",  as.  character  (annot  [4]  ),  ")  ", 
sep  =  "") 

K  =  seq(0,k.  Length .out  =50) 

y  =  unList(LappLy( (l-a)*KA2/2+a*K-t, resuLt,a=(l-a)/2, b=a) ) [seq(l,99, by=2) ] 
fiLLs  =  data. frame (x=c( rev ( -K) ,K) ,yl=c(rev(y) ,y) ,y2=c( -rev(y), -y)) 
p<-base+geom_Line(data=fiLLs,aes(x  =  x,y  =  yl) ,  coLour  =  " saLmonl ", 
aLpha=0. 6, size=0. 7)+ 

geom_Line(data=fiLLs,aes(x  =  x,y  =  y2) , coLour  =  "saLmonl" ,aLpha=0. 6, 
size=0. 7)+ 

geom_poLygon(data  =  fiLLs,  aes(x,  yl),fiLL  =  "red",  aLpha  =  0.2)+ 
geom  poLygon(data  =  fiLLs,  aes(x ,  y2),  fiLL  =  "red",  aLpha  =  0.2)+ 
geom_segment(data=pos,aes(x  =  VI [1]  ,  y  =  VI [2],  xend  =  VI [3], 
yend  =  Vl[4]) , 

arrow  =  arrow( Length=unit(0. 30,  "cm" ) ) , aLpha=0. 8, 
coLour  =  "magenta" )+ 

ggpLot2: : annotate (" text" ,  x  =  pos$Vl[l] -0.01,  y  =  pos$Vl[2] -0 .11, 

LabeL  =  paste(tex, "\n",  "Point  of  Contact  \n  i.e., 

Coef  of",  "aLpha=" , fractions (a)) , size=3)  + 
xLab (expression (beta[l] ))+ 
yLab(expression(beta[2] ) )+ 

ggtitLe (paste ("aLpha  =", as. character (fractions (a) )))+ 
theme( Legend. position= "none ") 

} 

#  $\alpha=0$  -  Ridge 

pi  <-  pLot_aLpha(aLpha=0,restrict=(0.21^2)/2,betal_range=0.21, 
annot=c(0. 15,  - 0.25,0.205 ,  -0.05)) 

pi  <-  pi  +  ggtitLe (expression (paste (aLpha,  ”=0  (Ridge)"))) 

#  $\alpha=l/9$ 

p2  <-  pLot_aLpha(aLpha=l/9,restrict=0.046,betal_range=0.22, 
annot  =c(0.15, -0.25,0.212,  -0.02)) 

p2  <-  p2  +  ggtitLe(expression(paste(aLpha,  "=l/9"))) 

#  $\alpha=l/5$ 

p3  <-  pLot_aLpha(aLpha=l/5, restrict=0.063,  betal_range=0. 22, 
annot=c(0. 13,  -0. 25,  0.22,0) ) 

p3  <-  p3  +  ggtitLe (expression (paste (aLpha,  ”=1/5"))) 

#  $\alpha=l/2$ 

p4  <-  pLot_aLpha(aLpha=l/2, restrict=0. 123, betal_range=0. 22, 
annot=c(0. 12,  -0. 25,  0.22,0) ) 
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p4  <-  p4  +  ggtitle (expression (paste (alpha,  "=l/2"))) 

#  $\alpha=3/4$ 

p5  <-  plot_alpha(alpha=3/4, restrict=0. 17,  betal_range=0. 22, 
annot=c(0. 12, -0. 25,  0.22,0) ) 

p5  <-  p5  +  ggtitle(expression(paste(alphaj  "=3/4"))) 

#  $\alpha=l$  -  LASSO 
t=0. 22 

K  =  seq(0,t,  Length,  out  =  50) 

fills  =  data.frame(x=c(-rev(K)jK)jyl=c(rev(t-K)jC(t-K))j 
y2=c( -rev( t-K) ,  -c(t-K) ) ) 
p6  <-  base  + 

geom_segment(aes(x  =  0,  y  =  t,  xend  =  t,  yend  =  0),  colour  =  "salmonl"j 
alpha=0. 1,  size=0. 2)  + 

geom_segment(aes(x  =  0,  y  =  t,  xend  =  -t,  yend  =  0)jColour  =  "salmonl"j 
alpha=0. 1,  size=0. 2)+ 

geom_segment(aes(x  =  0,  y  =  -t}  xend  =  t,  yend  =  0),coLour  =  "salmonl"j 
alpha=0. lj size=0. 2)+ 

geom_segment(aes(x  =  0,  y  =  -t,  xend  =  - t ,  yend  =  0)j colour  =  "salmonl"j 
alpha=0. lj size=0. 2)+ 

geom_polygon(data  =  fills j  aes(x}  yl)jfill  =  "red"j  alpha  =  0.2)+ 
geom_polygon(data  =  fills j  aes(x,  y2) }  fill  =  " red ",  alpha  =  0.2)  + 
geom_segment(aes(x  =  0.12  j  y  =  - 0.25 ,  xend  =  0.22,  yend  =  0), 
colour  =  "magenta", 

arrow  =  arrow( length=unit(0. 30, "cm") ) ,alpha=0. 8)+ 
ggplot2: : annotate ("text" ,  x  =  0.11,  y  =  -0.36, 

label  =  "(0.22,0)\n  Point  of  Contact  \n  i.e  Coef  of  LASSO" , size=3)+ 
xlab(  expression(beta[l]))+ 
ylab(  expression(beta[2] ) )+ 
theme ( Legend . position= "none ")+ 
ggtitle(expression(paste(alpha,  "=1  (LASSO)"))) 


Then,  let’s  add  the  six  feasible  regions  corresponding  to  a  =  0  (Ridge),  a 


a  =  i,  a  =  \ 


a  =  j  and  a  —  1  (LASSO). 


l 

9’ 


Figures  18.8,  18.9  and  18.10  provide  some  intuition  into  the  continuum  from 

Ridge  to  LASSO  regularization.  The  feasible  regions  are  drawn  as  ellipse  contours 

of  the  SSE  in  red.  Curves  around  the  corresponding  feasible  regions  represent  the 

l  —  a 

boundary  of  the  constraint  function  — - —  \\PW2  +  cr||/?||i  <  t. 

_  1  „  _  1  _  _  3 


In  this  example,  /?2  shrinks  to  0  for  a  =  k  a  =  a  =  |  and  a  =  1 


We  observe  that  it  is  almost  impossible  for  the  contours  of  Ridge  regression  to 
touch  the  circle  at  any  of  the  coordinate  axes.  This  is  also  true  in  higher  dimensions 
0 nD ),  where  the  Ly  and  L2  metrics  are  unchanged  and  the  2D  ellipse  representations 
of  the  feasibility  regions  become  hyper-ellipsoidal  shapes. 

Generally,  as  a  goes  from  0  to  1,  the  coefficients  of  more  features  tend  to  shrink 
towards  0.  This  specific  property  makes  LASSO  useful  for  variable  selection. 

Let’s  compare  the  feasibility  regions  corresponding  to  Ridge  (top,  pi)  and  LASSO 
(bottom,  p6)  regularization. 


plot(pl) 
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a=0  (Ridge) 
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Fig.  18.8  Ridge-regularization  SSE  contour  and  penalty  region 


cx=1  (LASSO) 


0.0  0.5  1.0 

Pi 


Fig.  18.9  LASS O-regularization  SSE  contour  and  penalty  region 
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Fig.  18.10  SSE  contour  and  penalty  region  for  six  continuous  values  of  the  alpha  parameter 
illustrating  the  smooth  transition  from  Ridge  ( a  =  0)  to  LASSO  (a  =  1)  regularization 


pLot(p6) 

Then,  we  can  plot  the  progression  from  Ridge  to  LASSO.  This  composite  plot  is 
intense  and  may  take  several  minutes  to  render,  Fig.  18.10!  Finally,  Fig.  18.11 
depicts  the  MSE  of  the  cross- validated  LASSO-regularized  model  against  the 
magnitude  of  number  of  non-trivial  coefficients  (top  axis).  The  dashed  vertical 
lines  suggest  an  optimal  range  [3:9]  for  number  of  features  to  include  in  the  model. 

Library ( "gridExtra" ) 

grid. arrange ( plj p2j p3j p4j p5j p6} nrow=3 ) 
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CV  LASSO:  Number  of  Nonzero  (Active)  Coefficients 
26  24  24  24  23  21  20  20  18  17  14  11  9  9  8  7  5  3  3  1  1  1  1  0 


Iog{  Lambda) 

Fig.  18.11  MSE  of  the  cross-validated  LASSO-regularized  model  against  the  magnitude  of  the 
regularization  parameter  (bottom  axis),  and  the  efficacy  of  the  model  selection,  i.e.,  number  of 
non-trivial  coefficients  (top  axis).  The  dashed  vertical  lines  suggest  an  optimal  range  for  the  penalty 
term  and  the  number  of  features 


18.6.4  Choice  of  the  Regularization  Parameter 

Efficiently  obtaining  the  entire  solution  path  is  nice,  but  we  still  have  to  choose  a 
specific  A  regularization  parameter.  This  is  critical  as  A  controls  the  bias- 
variance  tradeoff.  Traditional  model  selection  methods  rely  on  various 
metrics  like  Mallows’  Cp,  AIC,  BIC,  and  adjusted  R 2. 

Internal  statistical  validation  (Cross  validation)  is  a  popular  modern  alternative, 
which  offers  some  of  these  benefits: 

•  Choice  is  based  on  predictive  performance, 

•  Makes  fewer  model  assumptions, 

•  More  widely  applicable. 
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18.6.5  Cross  Validation  Motivation 

Ideally,  we  would  like  a  separate  validation  set  for  choosing  A  for  a  given  method. 
Reusing  training  sets  may  encourage  overfitting  and  using  testing  data  to  pick  A  may 
underestimate  the  true  error  rate.  Often,  when  we  do  not  have  enough  data  for  a 
separate  validation  set,  cross-validation  provides  an  alternative  strategy. 


18.6.6  n -Fold  Cross  Validation 

We  have  already  seen  examples  of  using  cross-validation,  e.g.,  Chap.  14,  and 
Chap.  21  provides  additional  details  about  this  internal  statistical  assessment  strategy. 

We  can  use  either  automated  or  manual  cross-validation.  In  either  case,  the 
protocol  involves  the  following  iterative  steps: 

1.  Randomly  split  the  training  data  into  n  parts  (“folds”). 

2.  Fit  a  model  using  data  in  n  —  1  folds  for  multiple  2s. 

3.  Calculate  some  prediction  quality  metrics  (e.g.,  MSE,  accuracy)  on  the  last 
remaining  fold,  see  Chap.  14. 

4.  Repeat  the  process  and  average  the  prediction  metrics  across  iterations. 

Common  choices  of  n  are  5,  10,  and  n  (which  corresponds  to  leave -one -out 
CV).  One  standard  error  rule  is  to  choose  A  corresponding  to  the  smallest  model  with 
MSE  within  one  standard  error  of  the  minimum  MSE. 

CV  Ridge:  Number  of  Nonzero  (Active)  Coefficients 


log  (Lambda) 


Fig.  18.12  Ridge-regularization,  similar  to  Fig.  18.11 
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18.6.7  LASSO  10-Fold  Cross  Validation 

Now,  let’s  apply  an  internal  statistical  cross-validation  to  assess  the  quality  of  the 
LASSO  and  Ridge  models,  based  on  our  Parkinson’s  disease  case- study.  Recall  our 
split  of  the  PD  data  into  training  (yTrain,  XTrain)  and  testing  (yTest,  XTest)  sets 
(Fig.  18.12). 


####  10-foLd  cross  vaLidation  #### 

#  LASSO 

L ibrary ( "g Lmnet ") 
set.seed(seed)  #  set  seed 

#  (10-fold)  cross  validation  for  the  LASSO 
cvLASSO  =  cv .gLmnet (XTrain ,  yTrain ,  aLpha  =  1) 
pLot( cvLASSO) 

mtext("CV  LASSO:  Number  of  Nonzero  (Active)  Coefficients' side=3}  Line=2.5) 


#  Report  MSE  LASSO 

predLASSO  <-  predict (cv LASSO ,  s  =  cvLASSO$ Lambda . lse}  newx  =  XTest) 
testMSE_LASSO  <-  mean ( (predLASSO  -  yTest) *2);  testMSE_LASSO 

##  [1]  200.5609 

####  10-foLd  cross  vaLidation  #### 

#  Ridge  Regression 
set.seed(seed)  #  set  seed 

#  (10-fold)  cross  validation  for  Ridge  Regression 
cvRidge  =  cv .gLmnet (XTrain }  yTrainj  aLpha  =  0) 
piot( cvRidge) 

mtext("CV  Ridge:  Number  of  Nonzero  (Active)  Coefficients' side=3}  Line=2.5) 


•  Report  MSE  Ridge 

predRidge  <-  predict (cvRidge ,  s  =  cvRidge$Lambda . lse j  newx  =  XTest) 
testNSE_Ridge  <-  mean ((predRidge  -  yTest)A2) ;  testNSE_Ridge 

##  [1]  195.7406 

Note  that  the  predict  ()  method,  applied  to  cv.gmlnet  or  glmnet  fore¬ 
casting  models,  is  effectively  a  function  wrapper  to  predict .  gmlnet  () . 
According  to  what  you  would  like  to  get  as  a  prediction  output,  you  can  use 
type="  ..."  to  specify  one  of  the  following  types  of  prediction  outputs: 

•  type  =  "link",  reports  the  linear  predictors  for  “binomial”,  “multinomial”, 
“poisson”  or  “cox”  models;  for  “gaussian”  models  it  gives  the  fitted  values. 

•  type  =  "response",  reports  the  fitted  probabilities  for  “binomial”  or  “multino¬ 
mial”,  fitted  mean  for  “poisson”  and  the  fitted  relative-risk  for  “cox”;  for  “gauss¬ 
ian”  type  “response”  is  equivalent  to  type  “link”. 

•  type  =  "coefficients",  reports  the  coefficients  at  the  requested  values  for  vs\  Note 
that  for  “binomial”  models,  results  are  returned  only  for  the  class  corresponding 
to  the  second  level  of  the  factor  response. 
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•  type  =  "class",  applies  only  to  “binomial”  or  “multinomial”  models,  and  pro¬ 
duces  the  class  label  corresponding  to  the  maximum  probability. 

•  type  =  "nonzero",  returns  a  list  of  the  indices  of  the  nonzero  coefficients  for  each 
value  of  vs\ 


18.6.8  Stepwise  OLS  (Ordinary  Least  Squares ) 


For  a  fair  comparison,  let’s  also  obtain  an  OLS  stepwise  model  selection,  see  Chap.  17. 


dt  =  as.data.  frame (cbind(yTrainjXTrain)) 
oLs_step  <-  Lm(yTrain  data  =  dt) 

oLs_step  <-  step(oLs_stepj  direction  =  'both'j  k=2}  trace  =  F) 
s ummary (oLs_step) 

## 

##  Cali: 

##  Lm(formuLa  =  yTrain  ~  L_cinguLate_gyrus_ComputeArea  + 
R_cinguLate_gyrus_VoLume  + 

##  L_caudate_VoLume  +  L_putamen_ComputeArea  +  L_putamen_VoLume  + 

##  R_putamen_ComputeArea  +  Weight  +  Age  +  chrl7 _rsll012_GT  + 

##  chrl7_rs393152_GT  +  chrl7_rsl2185268_GT  +  UPDRS_part_IJ  data  =  dt) 

## 

##  Residuals : 

##  Min  IQ  Median  3Q  Max 

##  -29.990  -9.098  -0.310  8.373  49.027 
## 

##  Coefficients : 


## 

Estimate 

Std.  Error 

t  value  Pr( > /t / ) 

##  (Intercept) 

-2.8179771 

4.5458868 

-0.620 

0.53548 

##  L_cingulate_gyrus_ComputeArea 

0.0045203 

0.0013422 

3.368 

0. 00079 

*  *  * 

##  R_cingulate_gyrus_Volume 

-0.0010036 

0.0003461 

-2.900 

0.00382 

** 

##  L_caudate_Volume 

-0.0021999 

0.0011054 

-1.990 

0.04686 

* 

##  L_putamen_ComputeArea 

-0.0087295 

0.0045925 

-1.901 

0.05764 

• 

##  L_putamen_Volume 

0.0035419 

0.0017969 

1.971 

0.04902 

* 

##  R_putamen_ComputeArea 

0.0029862 

0.0019036 

1.569 

0.11706 

##  Weight 

0.0424646 

0.0268088 

1.584 

0.11355 

##  Age 

0.2198283 

0.0522490 

4.207 

2. 84e-05 

*  *  * 

##  chrl7_rsll012_GT 

-4.2408237 

1.8122682 

-2.340 

0.01950 

* 

##  chrl7_rs393152_GT 

-3.5818432 

2.2619779 

-1.584 

0.11365 

##  chrl7_rsl2185268_GT 

8.2990131 

2.7356037 

3.034 

0.00248 

** 

##  UPDRS_part_I 
##  --- 

##  Signif.  codes:  0  '***'  0.001 

3.8780897 

0.2541024 

15.262 

<  2e-16 

*  *  * 

'**' 

0.05  '. 

'  0.1  ' 

'  1 

## 

##  Residual  standard  error:  13.41  on  911  degrees  of  freedom 
##  Multiple  R-squared :  0.2SS6}  Adjusted  R-squared :  0.2457 

##  F-statistic :  26.06  on  12  and  911  DF }  p-value:  <  2.2e-16 

We  use  direct ion=both  for  both  forward  and  backward  selection  and 
choose  the  optimal  one.  k=2  specifies  AIC  and  BIC  criteria,  and  you  can  choose 
k  ~  log  (n). 
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Then,  we  use  the  olsstep  model  to  predict  the  outcome  Y  for  some  new 
test  data. 

betaHatOLS_step  =  oLs_step$coefficients 
var_step  <-  coLnames(oLs_step$modeL) [ -1] 

XTestOLS_step  =  cbind(rep(lj  nrow(XTest) ) ,  XTest  [jVar_step]  ) 
predOLS_step  =  XTestOLS_step%*%betaHatOLS_step 
testMSEOLS_step  =  mean( (predOLS_step  -  yTest)A2) 

#  Report  MSE  OLS  Stepwise  feature  selection 
testMSEOLS_step 

##  [1]  186.3043 

Alternatively,  we  can  predict  the  outcomes  directly  using  the  predict  () 
function,  and  the  results  should  be  identical: 

pred2  <-  predict (oLs_step, as. data,  frame  (XTest)) 
any(pred2  ==  predOLS_step) 

##  [1]  TRUE 

18.6.9  Final  Models 

Let’s  identify  the  most  important  (predictive)  features,  which  can  then  be  interpreted 
in  the  context  of  the  specific  data. 

#  Determine  final  models 

#  Extract  Coefficients 

#  OLS  coefficient  estimates 
betaHatOLS  =  fitOLS$coefficients 

#  LASSO  coefficient  estimates 

betaHatLASSO  =  as .double (coef( fit  LASSO }  s  =  cvLASSO$ Lambda . lse) )  #  s  is  lam 
bda 

#  Ridge  coefficient  estimates 

betaHatRidge  =  as .double (coef(fitRidgej  s  =  cvRidge$Lambda . lse) ) 

#  Test  Set  MSE 

#  calculate  predicted  values 

XTestOLS  =  cbind(rep(lj  nrow(XTest) ) ,  XTest)  #  add  intercept  to  test  data 
predOLS  =  XTestOLS%*%betaHatOLS 

predLASSO  =  predict ( fit LASSO ,  s  =  cv LASSO$ Lambda .lse j  newx  =  XTest) 
predRidge  =  predict (fitRidgej  s  =  cvRidge$Lambda.lsej  newx  =  XTest) 

#  calculate  test  set  MSE 
testMSEOLS  =  mean( (predOLS  -  yTest)A2) 
testMSELASSO  =  mean( (predLASSO  -  yTest)A2) 
testMSERidge  =  mean( (predRidge  -  yTest)A2) 


Figure  18.13  shows  a  rank-ordered  list  of  the  key  predictors  of  the  clinical 
outcome  variable  (total  UPDRS,  y  <-  datal$UPDRS_part_I  +  datal 
$UPDRS_part_II  +  datal$UPDRS_part_III). 
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Regression  Coefficient  Estimates 

-4  -2  0  2  4  6  8 


UPDRSjTart  l 
chrl  7_rs1 99533_GT 
chr17_rs12185263_GT 
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Chrl  7  rsIltHZ  GT 
chrl  7_f  s  1 1  fl66035_GT 
chrl  2rs34 6 3 7584J3T 
Age 
Weight 
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R_jwtamen_Vo[uiTie 
Rjputamen^CompuleArea 
L_putamen_Votume 
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R_caudaie_Voiunie 
R_caudale_ComipuleArea 
L_eaudale_VotumG 
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R_cingulat&_gyfus_CompuleArea 
L_cingylate^gyru5_VciEume 
L_cinguTat&_gyfus_CempuleArea 
RJn  &u  lar_cortGX_Votu(ne 
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•  OLS 

•  LASSO 

•  Ridge 


Fig.  18.13  Variables  importance  plot  for  the  three  alternative  models 


#  Plot  Regression  Coefficients 

#  create  variable  names  for  plotting 
Library ( "arm" ) 

par(mar=c(2,  13 ,  lj  1))  #  extra  large  left  margin 

varNames  <-  coLnames(datal[  ,  ! (names (datal)  %in%  drop_features) ] ) 

v a r Names;  Length ( varNames ) 


##  [ 1 ] 
##  [3] 

##  [5] 

##  [7] 

##  [9] 

##  [11] 
##  [13] 
##  [15] 
##  [17] 
##  [19] 
##  [21] 
##  [23] 
##  [25] 
##  [27] 


'L_insuLar_cortex_ComputeArea" 
'  R_insuLar_cortex_ComputeArea" 
'L_cinguLate_gyrus_ComputeArea 
'R_cinguLate_gyrus_ComputeArea 
'  L_caudate_ComputeArea " 
'R_caudate_ComputeArea  " 

'  L_putamen_ComputeArea " 

'  R_putamen_ComputeArea " 

’Sex" 

’Age  " 

’  chrl7_rsll868035_GT" 

’  chrl7_rs393152_GT" 

’  chrl7_rsl99533_GT" 

'time  visit" 


L_insuLar_cortex_VoLume  " 
R_insuLar_cortex_VoLume  " 
L_cinguLate_gyrus_VoLume 
R_cinguLate_gyrus_VoLume 
L_caudate_VoLume  " 

R_ caudate_ Vo L ume " 
L_putamen_VoL  ume  " 
R_putamen_VoLume " 

Weight" 

chrl2_rs34637584_GT" 

chrl7_rsll012_GT" 

chrl7_rsl2185268_GT" 

UPDRS_part_I" 


##  [1]  27 


#  Graph  27  regression  coefficients  (exclude  intercept  [I],  betaHat 
indices  2:27) 

coefpLot(betaHatOLS[2:27],  sd  =  rep(0j  26),  cex.pts  =  5 , 
main  =  "Regression  Coefficient  Estimates ",  varnames  =  varNames) 
coefpLot(betaHatLASS0[2:27] ,  sd  =  rep(0,  26),  add  =  TRUE,  coL.pts  =  "red", 
cex.pts  =5) 

coefpLot(betaHatRidge[2:27] ,  sd  =  rep(0,  26),  add  =  TRUE,  coL.pts  =  "bLue", 
cex.pts  =5) 

Legend (" bottomright" ,  c("OLS",  "LASSO",  "Ridge"),  coL  =  c("bLack" ,  "red", 
"bLue"),  pch  =  c(20,  20  ,  20),  bty  =  "o") 
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Table  18.2  Testing 
data  MSE 


OLS 

OLS step 

LASSO 

Ridge 

183.3239 

186.3043 

200.5609 

195.7406 

18.6.10  Model  Performance 

Table  18.2  quantifies  the  performance  of  the  four  models. 


#  Test  Set  MSE  Table 

#  create  table  as  data  frame 

A ISETabLe  =  data .frame(OLS=testMSEOLSJ  OLS_step=testMSEOLS_stepJ 
LASSO=testMSE LASSO,  Ridge=testMSERidge ) 

#  convert  to  markdown 

kabLe(MSETabLe,  format=" pandoc" ,  caption="Test  Set  MSE ",  aLign=c( "c"j  "c"j 
" c ",  "c")) 

18.6.11  Comparing  Selected  Features 

var_step  =  names  (oLs_step$coefficients)  [-1] 

var_Lasso=coLnames(XTrain) [which(coef(fitLASSO, s=cvLASSO$Lambda.min ) !=0)-l] 
intersect( var_stepj var_Lasso) 

##  [1]  "L_cinguLate_gyrus_ComputeArea"  "R_putamen_ComputeArea" 

##  [3]  "height"  "Age" 

##  [5]  " chrl7_rsl2185268_GT"  "UPDRS_part_I" 

coef (fit LASSO j  s  =  cvLASSO$ Lambda. min) 

##  28  x  1  sparse  Matrix  of  cLass  "dgCMatrix" 

##  1 

##  (Intercept)  1.7142107049 

##  L_insuLar_cortex_ComputeArea 
##  L_insuLar_cortex_VoLume 
##  R_insuLar_cortex_ComputeArea 
##  R_insuLar_cortex_VoLume 

##  L_cinguLate_gyrus_ComputeArea  0.0003399436 
##  L_cinguLate_gyrus_VoLume  0.0002099980 

##  R_cinguLate_gyrus_ComputeArea 
##  R_cinguLate_gyrus_VoLume 
##  L_caudate_ComputeArea 
##  L_caudate_VoLume 
##  R_caudate_ComputeArea 
##  R_caudate_VoLume 
##  L_putamen_ComputeArea 
##  L_putamen_VoLume 
##  R_putamen_ComputeArea 
##  R_putamen_VoLume 
##  Sex 
##  height 
##  Age 

##  chrl2_rs34637584_GT 
##  chrl7_rsll868035_GT 
##  chrl7_rsll012_GT 
##  chrl7_rs393152_GT 
##  chrl7_rsl2185268_GT 
##  chrl7_rsl99533_GT 
##  UPDRS_part_I 
##  time  visit 


0.0010417502 


0.0336216322 

0.2097678904 

• 

-0.0094055047 


0.2688574886 

0.3730813890 

3.7697168303 
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Stepwise  variable  selection  for  OLS  selects  12  variables,  whereas  LASSO  selects 
9  variables  with  the  best  A.  There  are  6  common  variables  identified  as  salient 
features  by  both  OLS  and  LASSO. 


18.6.12  Summary 

Traditional  linear  models  are  useful  but  also  have  their  shortcomings: 

•  Prediction  accuracy  may  be  sub-optimal. 

•  Model  interpretability  may  be  challenging  (especially  when  a  large  number  of 
features  are  used  as  regressors). 

•  Stepwise  model  selection  may  improve  the  model  performance  and  add  some 
interpretations,  but  still  may  not  be  optimal. 

Regularization  adds  a  penalty  term  to  the  estimation: 

•  Enables  exploitation  of  the  bias-variance  tradeoff. 

•  Provides  flexibility  on  specifying  penalties  to  allow  for  continuous  variable 
selection. 

•  Allows  incorporation  of  prior  knowledge. 


18.7  Knock-off  Filtering:  Simulated  Example 

Variable  selection  that  controls  the  false  discovery  rate  (FDR)  of  salient  features  can 
be  accomplished  in  different  ways.  Knockoff  filtering  represents  one  strategy  for 
controlled  variable  selection.  To  show  the  usage  of  knockoff  .filter  we  start 
with  a  synthetic  dataset  constructed  so  that  the  true  coefficient  vector  /?  has  only  a 
few  nonzero  entries. 

The  essence  of  knockoff  filtering  is  based  on  the  following  three-step  process: 

•  Construct  the  decoy  features  (knockoff  variables),  one  for  each  real  observed 
feature.  These  act  as  controls  for  assessing  the  importance  of  the  real  variables. 

•  For  each  feature,  Xh  compute  the  knockoff  statistic,  Wj,  which  measures  the 
importance  of  the  variable,  relative  to  its  decoy  counterpart,  Xt. 

•  Determine  the  overall  knockoff  threshold.  This  is  computed  by  rank-ordering  the 
Wj  statistics  (from  large  to  small),  walking  down  the  list  of  Wf  s,  selecting  vari¬ 
ables  Xj  corresponding  to  positive  Wf  s,  and  terminating  this  search  the  last  time 
the  ratio  of  negative  to  positive  Wf  s  is  below  the  default  FDR  q  value,  e.g., 
q  =  0.10. 
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Mathematically,  we  consider  Xj  to  be  unimportant  (i.e.,  peripheral  or  extraneous) 
if  the  conditional  distribution  of  Y  given  Xl9 . . .,  Xp  does  not  depend  on  Xj.  Formally, 
Xj  is  unimportant  if  it  is  conditionally  independent  of  Y  given  all  other  features,  X_f. 


YLXj 


We  want  to  generate  a  Markov  Blanket  of  7,  such  that  the  smallest  set  of  features 
/  satisfies  this  condition.  Further,  to  make  sure  we  do  not  make  too  many  mistakes, 

/s 

we  search  for  a  set  S  controlling  the  false  discovery  rate  (FDR): 


FDR(S )  =  E 


#j  6  S  :  Xj  unimportant  \ 


#j&S 


) 


<  q  (e.g.  10%). 


Let’s  look  at  one  simulation  example. 


#  Problem  parameters 

n  =  1000  #  number  of  observations 

p  =  300  #  number  of  variables 

k  =  30  #  number  of  variables  with  nonzero  coefficients 

ampLitude  =3.5  #  signal  amplitude  (for  noise  level  =  1) 

#  Problem  data 

X  =  matrix(rnorm(n*p) j  nrow=rij  ncoL=p) 

nonzero  =  sampLe(pj  k) 

beta  =  ampLitude  *  (l:p  %in%  nonzero) 

y. sample  <-  function ( )  X  %*%  beta  +  rnorm(n) 

To  begin  with,  we  will  invoke  the  knockoff  .  filter  using  the  default  settings. 

#  install. packages ("knockoff") 

Library (knockoff ) 

y  =  y.sampLe() 

result  =  knockoff .filter (Xj  y) 
print(result) 

##  Call: 

##  knockoff  .filter  (X  =  X}  y  =  y) 

## 

##  Selected  variables : 

##  [1]  6  29  30  42  52  54  63  68  70  83  88  96  102  113  115  135  138 

##  [18]  139  167  176  179  194  212  220  225  228  241  248  265  273  287  288  295 

The  false  discovery  proportion  (fdp)  is: 

fdp  <-  function(selected)  sum(beta [selected]  ==  0)  /  max(lj  Length(selected) 

) 

fdp(result$selected) 

##  [1]  0.09090909 


This  yields  an  approximate  FDR  of  0.10. 
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The  default  settings  of  the  knockoff  filter  use  a  test  statistic  based  on  LASSO  — 
knockoff  .  stat .  lasso_signed_max,  which  computes  the  Wj  statistics  that 
quantify  the  discrepancy  between  a  real  (Xf)  and  a  decoy,  knockoff  (Xj),  feature: 

Wj  =  max(Xj,Xj)xsgn{Xj-Xj). 

Effectively,  the  Wj  statistics  measure  how  much  more  important  the  variable  Xj  is 
relative  to  its  decoy  counterpart  Xj.  The  strength  of  the  importance  of  Xj  relative  to 
Xj  is  measured  by  the  magnitude  of  Wj. 

The  knockoff  package  includes  several  other  test  statistics,  with  appropriate 
names  prefixed  by  knockoff .  stat.  For  instance,  we  can  use  a  statistic  based  on 
forward  selection  (fs)  and  a  lower  target  FDR  of  0.10. 

result  =  knockoff  .filter (Xj  y,  fdr  =  0.10j  statistic  =  knockoff . stat  .fs) 
fdp(result$selected) 

##  [1]  0.1428571 

One  can  also  define  additional  test  statistics,  complementing  the  ones  included  in 
the  package  already.  For  instance,  if  we  want  to  implement  the  following  test- 
statistics: 


We  can  code  it  as: 


new_knockoff_stat  <-  function(Xj  X_ko}  y)  { 
abs(t(X)  %*%  y)  -  abs(t(X_ko)  %*%  y) 

} 

result  =  knockoff  .filter (Xj  y}  statistic  =  new_knockoff_stat) 
fdp(result$selected) 

##  [1]  0.3333333 


18.7.1  Notes 

The  knockoff  .  filter  function  is  a  wrapper  around  several  simpler  functions  that 
(1)  construct  knockoff  variables  (knockoff. create);  (2)  compute  the  test  statistic 
W  (various  functions  with  prefix  knockoff. stat);  and  (3)  compute  the  threshold  for 
variable  selection  (knockoff. threshold). 

The  high-level  function  knockoff. filter  will  automatically  normalize  the 
columns  of  the  input  matrix  (unless  this  behavior  is  explicitly  disabled).  However, 
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all  other  functions  in  this  package  assume  that  the  columns  of  the  input  matrix  have 
unitary  Euclidean  norm. 


18.8  PD  Neuroimaging-Genetics  Case-Study 

Let’s  illustrate  controlled  variable  selection  via  knockoff  filtering  using  the  real  PD 
dataset. 

The  goal  is  to  determine  which  imaging,  genetics  and  phenotypic  co variates 
are  associated  with  the  clinical  diagnosis  of  PD.  The  dataset  is  publicly  available 
online. 


18.8.1  Fetching,  Cleaning  and  Preparing  the  Data 

The  data  set  consists  of  clinical,  genetics,  and  demographic  measurements.  To 
evaluate  our  results,  we  will  compare  diagnostic  predictions  created  by  the  model 
for  the  UPDRS  scores  and  the  ResearchGroup  factor  variable.  First,  we  download 
the  data  and  read  it  into  data  frames. 


dotal  <-  read. tabLe( ' https : //umich . instructure. com/ files/330397/down Load ?dow 
nioad_frd=l sep=" ,  ",  header=T) 

#  we  will  deal  with  missing  values  using  multiple  imputation  later.  For  now, 
let's  just  ignore  incomplete  cases 

datal. completeRowIndexes  <-  complete .cases(datal) 

#  table (datal. completeRowIndexes) 

prop. table(table(datal. completeRowIndexes) ) 

##  datal . completeRowIndexes 
##  FALSE  TRUE 

##  0.3452381  0.6547619 

#  attach(datal) 

#  View(datal[datal. completeRowIndexes,  ]) 
data2  <-  datal [datal. completeRowIndexes }  ] 

Dx_label  <-  data2$ResearchGroup;  table (Dx_Label) 

##  Dx_ Label 

##  Control  PD  SMEDD 

##  121  897  137 

We  now  construct  the  design  matrix  X  and  the  response  vector  Y.  The 
features  (columns  of  X )  represent  covariates  that  will  be  used  to  explain  the 
response  V. 
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#  Construct  preliminary  design  matrix. 

#  Define  response  and  predictors 

Y  <-  datal$UPDRS_part_I  +  datal$UPDRS_part_II  +  datal$UPDRS_part_III 
tabie(Y)  #  Show  Clinically  relevant  classification 

##  Y 

##0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24 

##54  20  25  12  8  7  11  16  16  9  21  16  13  13  22  25  21  31  25  29  29  28  20  25  28 

##25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49 

##26  35  41  23  34  32  31  37  34  28  36  29  27  22  19  17  18  18  19  16  9  10  12  9  11 

##50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  66  68  69  71  75  80  81  82 

##7  10  11  57  415943216121211231 

Y  <-  Y[dotal . compLeteRowIndexes] 

#  X  =  scale(ncaaData[ ,  -20])  #  Explicit  Scaling  is  not  needed,  as 

glmnet  auto  standardizes  predictors 

#  X  =  as .matrix(datal[ ,  c("R_caudate_Volume" ,  "R_putamen_Volume" ,  "Weight", 

"Age",  "chrl7_rsl2185268_GT" ) ] )  #  X  needs  to  be  a  matrix,  not  a  data  frame 

drop_features  <-  c(  r'FID_IID' ,  "ResearchGroup' ,  "UPDRS_part_I" , 

"  UPDRS_part_II ",  "UPDRS_part_III ") 

X  <-  datal[  ,  / (names (dotol)  %in%  drop_features) ] 

X  =  as .motrix(X)  #  remove  columns:  index,  ResearchGroup,  and 
y=(PDRS_part_I  +  UPDRS_part_II  +  UPDRS_part_III) 

X  <-  X[datal . compLeteRowIndexes j  ];  dim(X) 

##  [1]  1155  26 

summary  (X) 


## 

L_insuLar_cortex_ComputeArea  L_insuLar_cortex_ 

VoLume 

## 

Min.  :  50.03 

Min. 

22.63 

## 

1st  Qu. : 2174. 57 

1st  Qu. 

5867.23 

## 

Median  : 2522. 52 

Median 

7362.90 

## 

Mean  : 2306. 89 

Mean 

6710.18 

## 

3rd  Qu. : 2752. 17 

3rd  Qu. 

8483.80 

## 

Max.  : 3650. 81 

Max. 

13499.92 

## 

chrl7_rs393152_GT 

chrl7_rsl2185268_GT  chrl7_rsl99533_GT 

time_ 

visit 

## 

Min.  : 0.0000 

Min. 

0. 0000 

Min. 

0. 0000 

Min. 

:  0.00 

## 

1st  Qu. :0. 0000 

1st  Qu. 

0. 0000 

1st  Qu. 

0. 0000 

1st  Qu. 

:  9.00 

## 

Median  : 0.0000 

Median 

0. 0000 

Median 

0. 0000 

Median 

: 24 . 00 

## 

Mean  : 0.4468 

Mean 

0.4268 

Mean 

0.4052 

Mean 

: 23 . 83 

## 

3rd  Qu. : 1.0000 

3rd  Qu. 

1 . 0000 

3rd  Qu. 

1 . 0000 

3rd  Qu. 

: 36.00 

## 

Max.  : 2.0000 

Max. 

2. 0000 

Max. 

2 . 0000 

Max. 

: 54.00 

mode(X)  <-  'numeric' 


Dx_LabeL  <-  Dx_LabeL[datal. compLeteRowIndexes] ;  Length (Dx_LabeL) 
##  [1]  1155 


18.8.2  Preparing  the  Response  Vector 

The  knockoff  filter  is  designed  to  control  the  FDR  under  Gaussian  noise.  A  quick 
inspection  of  the  response  vector  shows  that  it  is  highly  non-Gaussian  (Figs.  18.14 
and  18.5). 


610 


18  Regularized  Linear  Modeling  and  Controlled  Variable  Selection 


0  20  40  60  80 

Y 

Fig.  18.14  Histogram  of  the  outcome  clinical  diagnostic  variable  (. Y )  for  the  Parkinson’s  disease 
case-study 


Histogram  of  log(Y) 


0  12  3  4 

log(Y) 

Fig.  18.15  Log-transformed  histogram  of  the  outcome  clinical  diagnostic  variable  (F) 


hist(Yj  breaks=' FD' ) 

A  log- transform  may  help  to  stabilize  the  clinical  response  measurements: 


hist( Log(Y) j  breaks= ' FD' ) 
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Fig.  18.16  Logistic  curve  transforming  a  continuous  variable  into  a  probability  value 

For  binary  outcome  variables,  or  ordinal  categorical  variables,  we  can  employ 
the  logistic  curve  to  transform  the  polytomous  outcomes  into  real  values 
(Fig.  18.16). 

The  Logistic  curve  is  y  =  f(x)  =  ]+le-x,  where  y  and  x  represent  probability  and 
quantitative-predictor  values,  respectively.  A  slightly  more  general  form  is: 
y  =  f(x)  =  where  the  covariate  x  E  (—00,00)  and  the  response  y  E  [0, 
K\.  For  example, 

L ibrary ( "ggp Lot 2 ") 
k=7 

x  <-  seq( -10 }  10j  1 ) 

pLot(Xj  k/(l+exp(-x))j  xLab="X-axis  (Covariate) ",  yLab="Outcome  k/(l+exp(-x) 
)j  k=7" ,  type="L ") 


The  point  of  this  logistic  transformation  is  that: 


1 


y  = 


1  +  e 


-x 


v  =  In 


y 


l -y 


which  represents  the  log- odds  (where  y  is  the  probability  of  an  event  of  interest)! 
We  use  the  logistic  regression  equation  model  to  estimate  the  probability  of 

specific  outcomes:  (Estimate  of)  P{Y  =  1|jci,jc2,  .,^/)  = - - 7 - 

1  +  e-  (a»+Et=i 

where  the  coefficients  a0  (intercept)  and  effects  ah  k  —  1,2,  ...,/,  are  estimated 
using  GLM  according  to  a  maximum  likelihood  approach.  Using  this  model 
allows  us  to  estimate  the  probability  of  the  dependent  (clinical  outcome)  variable 
7=1  (CO),  i.e.,  surviving  surgery,  given  the  observed  values  of  the  predictors  Xk, 
k=  1,  2,  ...,  /. 
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i.oo-  *  *  •  *  ***** 


2.5  5.0  7.5  10.0 

SE 


Fig.  18.17  Estimate  of  the  logistic  function  for  the  clinical  outcome  (CO)  probability  based  on  the 
surgeon’s  experience  (SE) 


Table  18.3  Survival  outcomes  of  a  hypothetical  surgical  transplant  treatment  based  on  surgeon’s 
experience 


Surgeon’s  expe¬ 
rience  (SE) 

1 

1.5 

2 

2.5 

3 

3.5 

3.5 

4 

4.5 

5 

5.5 

6 

6.5 

7 

8 

8.5 

9 

9.5 

10 

10 

Clinical  out¬ 
come  (CO) 

0 

0 

0 

0 

0 

0 

1 

0 

1 

0 

1 

0 

1 

0 

1 

1 

1 

1 

1 

1 

Probability  of  surviving  a  heart  transplant  based  on  surgeon’s  experience.  A 

group  of  20  patients  undergo  heart  transplantation  with  different  surgeons  having 
experience  in  the  range  {0(least),  2,  10(most)},  representing  100’s  of  operating/ 

surgery  hours.  How  does  the  surgeon’s  experience  affect  the  probability  of  the 
patient  survival? 

The  data  below  shows  the  outcome  of  a  surgery  (1  =  survival)  or  (0  =  death) 
according  to  the  surgeons’  experience  in  100’s  of  hours  of  practice  (Fig.  18.17  and 
Table  18.3). 

mydata  <-  read. csv("https : //umich .instructure, com/f ties /4Q5273/ download? down 
Load_frd=l" )  #  01_HeartSurgerySurvivalData . csv 

#  estimates  a  logistic  regression  model  for  the  clinical  outcome  (CO), 
survival,  using  the  glm 

#  (generalized  linear  model)  function. 

#  convert  Surgeon's  Experience  (SE)  to  a  factor  to  indicate  it  should  be 
treated  as  a  categorical  variable. 

#  mydata$rank  <-  factor (mydata$SE) 

my  logit  <-  glm(CO  ~  SE,  data  =  mydata ,  family  =  "binomial" ) 

#  library(ggplot2) 

ggplot(mydataj  aes(x=SEj  y=CO))  +  geom_point()  + 

stat_smooth(method=" glm" j method. args=list(family="binomial ") }  se=FALSE) 


18.8  PD  Neuroimaging-Genetics  Case-Study 


613 


Graph  of  a  logistic  regression  curve  showing  probability  of  surviving  the  surgery 
versus  surgeon’s  experience,  Fig.  18.17. 

The  graph  shows  the  probability  of  the  clinical  outcome,  survival,  (Y-axis)  versus 
the  surgeon’s  experience  (X-axis),  with  the  logistic  regression  curve  fitted  to 
the  data. 

my  Logit  <-  gim(CO  ~  SE ,  data  =  my  data,  family  =  "binomial" ) 
summary (my  Logit) 

##  Call: 

##  glm( formula  =  CO  ~  SE,  family  =  "binomial",  data  =  mydata) 

## 

##  Deviance  Residuals : 

##  Min  IQ  Median  3Q  Max 

##  -1.7131  -0.5719  -0.0085  0.4493  1.8220 
## 

##  Coefficients : 

##  Estimate  Std.  Error  z  value  Pr(>/zl) 

##  (Intercept)  -4.1030  1.7629  -2.327  0.0199  * 

##  SE  0.7583  0.3139  2.416  0.0157  * 

##  --- 

##  Signif.  codes:  0  '***'  0.001  '**'  0.01  '*'  0.05  '.'0.1  '  '  1 
## 

##  (Dispersion  parameter  for  binomial  family  taken  to  be  1) 

## 

##  Null  deviance:  27.726  on  19  degrees  of  freedom 
##  Residual  deviance:  16.092  on  18  degrees  of  freedom 
##  AIC:  20.092 
## 

##  Number  of  Fisher  Scoring  iterations :  5 

The  output  indicates  that  surgeon’s  experience  (SE)  is  significantly  associated 
with  the  probability  of  surviving  the  surgery  (0.0157,  Wald  test).  The  output  also 
provides  the  coefficients  for: 

•  Intercept  =  —4.1030,  and 

•  SE  =  0.7583. 

These  coefficients  can  then  be  used  in  the  logistic  regression  equation  model  to 
estimate  the  probability  of  surviving  the  heart  surgery: 

Probability  of  surviving  heart  surgery  CO  =  i+exp(_(_4,io3o+o.7583 xSE))  • 

For  example,  for  a  patient  who  is  operated  on  by  a  surgeon  with  200  h  of 
operating  experience  (SE  =  2),  we  plug  in  the  value  2  in  the  equation  to  get  an 
estimated  probability  of  survival,  p  =  0.07: 


SE=2 

CO  =1/ (l+exp( -( -4.1030+0. 7583*SE) ) ) 
CO 


##  [1]  0.07001884 
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Similarly,  a  patient  undergoing  heart  surgery  with  a  doctor  who  has 
400  operating  hours  experience  (SE  =  4),  the  estimated  probability  of  survival  is 

p  =  0.26: 

SE=4 ;  CO  =1/ (l+exp(- (-4. 1030+0. 7583*SE ) ) )j  CO 
##  [1]  0.2554411 
CO 

##  [1]  0.2554411 

for  (SE  in  c(l:5))  { 

CO  <-  l/(l+exp(-( -4.1030+0. 7583*SE))); 
print(c(SEj  CO)) 

} 

##  [1]  1.00000000  0.03406915 
##  [1]  2.00000000  0.07001884 
##  [1]  3.0000000  0.1384648 
##  [1]  4.0000000  0.2554411 
##  [1]  5.0000000  0.4227486 

[1]  0.2554411 

The  table  below  shows  the  probability  of  surviving  surgery  for  several  values  of 
surgeons’  experience  (Table.  18.4). 

The  output  from  the  logistic  regression  analysis  gives  a  p-value  of  p  =  0.0157, 
which  is  based  on  the  Wald  z-score.  In  addition  to  the  Wald  method,  we  can 
calculate  the  p-value  for  logistic  regression  using  the  Likelihood  Ratio  Test  (LRT), 
which  for  these  data  yields  0.0006476922  (Table  18.5). 


Table  18.4  Estimates  of  the  likelihood  of  transplant  surgery  patient  survival  based  on  SE 


Surgeon’s  experience  (SE) 

Probability  of  patient  survival  (Clinical  outcome) 

1 

0.034 

2 

0.07 

3 

0.14 

4 

0.26 

5 

0.423 

Table  18.5  Estimates  of  the  effect-size,  standard  error  and  p-value  quantifying  the  significance  of 
SE  on  CO 


• 

Estimate 

Std.  error 

z  value 

Pr(>z)  Wald 

SE 

0.7583 

0.3139 

2.416 

0.0157  * 
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my  Logit  <-  gLm(CO  ~  SEj  data  =  my  data,  famiLy  =  "binomial" ) 
summary (my  Logit) 

##  Call: 

##  glm( formula  =  CO  ~  SEj  famiLy  =  "binomial" ,  data  =  mydata) 

## 

##  Deviance  Residuals : 

##  Min  IQ  Median  3Q  Max 

##  -1.7131  -0.5719  -0.0085  0.4493  1.8220 
## 

##  Coefficients : 

##  Estimate  Std.  Error  z  value  Pr(>lzl) 

##  (Intercept)  -4.1030  1.7629  -2.327  0.0199  * 

##  SE  0.7583  0.3139  2.416  0.0157  * 

##  --- 

##  Signif.  codes:  0  '***'  0.001  '**'  0.01  '*'  0.05  '.'0.1  '  '  1 
## 

##  (Dispersion  parameter  for  binomial  family  taken  to  be  1) 

## 

##  Null  deviance:  27.726  on  19  degrees  of  freedom 
##  Residual  deviance:  16.092  on  18  degrees  of  freedom 
##  AIC:  20.092 

##  Number  of  Fisher  Scoring  iterations :  5 

The  logit  of  a  number  0  <  p  <  1  is  given  by  the  formula:  logit(p )  =  logjzp,  and 
represents  the  log-odds  ratio  (of  survival  in  this  case)  (Table.  18.6). 

confint (my  Logit) 

##  2.5  %  97.5  % 

##  (Intercept)  -8.6083535  -1.282692 
##  SE  0.2687893  1.576912 

So,  why  do  we  need  to  exponentiate  the  coefficients?  Because, 
logit (p)  =  log—^~ - »  eloglt{p)  =  elog^p  — >  RHS  =  — ^ — ,  (odds  —  ratio,  OR). 

1 —p  1  —  p 

exp (coef (my Logit))  #  exponentiated  logit  model  coefficients 

##  (Intercept)  SE 

##  0.01652254  2.13474149 


(Intercept)  SE 

0.01652254  2.13474149  ==  exp(0.7583456) 
coef(mylogit)  #  raw  logit  model  coefficients 
(Intercept)  SE 


-4.1030298  0.7583456 


Table  18.6  Point  and  interval 
estimates  of  the  odds  ratio  of 
survival 


. 

OR 

2.5% 

97.5% 

(Intercept) 

0.01652254 

0.0001825743 

0.277290 

SE 

2.13474149 

1.3083794719 

4.839986 
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exp(cbind(OR  =  coef (my Logit) ,  confint (my Logit))) 

##  OR  2.5  %  97.5  % 

##  (Intercept)  0.01652254  0.0001825743  0.277290 
##  SE  2.13474149  1.3083794719  4.839986 

We  can  compute  the  LRT  and  report  its  p-values  (0.0006476922)  by  using  the 
with( )  function: 

with(mylogit,  df.null  -  df. residual) 

with(my Logit .  pchisq(nuLL. deviance  -  deviance,  df.nuii  -  df . residual . 

Lower .tail  =  FALSE)) 

##  [1]  0.0006476922 

LRT  p-value  <0.001  tells  us  that  our  model  as  a  whole  fits  significantly  better  than 
an  empty  model.  The  deviance  residual  is  -2*log  likelihood,  and  we  can 
report  the  model’s  log  likelihood  by: 

LogLik(my Logit) 

##  'Log  Lik.’  -8.046117  (df=2) 

The  LRT  compares  the  data  fit  of  two  models.  For  instance,  removing  predictor 
variables  from  a  model  may  reduce  the  model  quality  (i.e.,  a  model  will  have  a  lower 
log  likelihood).  To  statistically  assess  whether  the  observed  difference  in  model  fit  is 
significant,  the  LRT  compares  the  difference  of  the  log  likelihoods  of  the  two 
models.  When  this  difference  is  statistically  significant,  the  full  model  (the  one 
with  more  variables)  represents  a  better  fit  to  the  data,  compared  to  the  reduced 
model.  LRT  is  computed  using  the  log  likelihoods  (//)  of  the  two  models: 

L(m2)J 


2(//(w2)  —  //(mi)), 


where: 

•  my  and  m2  are  the  reduced  and  the  full  models,  respectively, 

•  L(mi)  and  L(m2)  denote  the  likelihoods  of  the  2  models,  and 

•  ll(mi)  and  ll(m2)  represent  the  log  likelihood  (natural  log  of  the  model  likelihood 
function). 

As  n  — >  oo,  the  distribution  of  the  LRT  is  asymptotically  chi-squared  with  degrees 
of  freedom  equal  to  the  number  of  parameters  that  are  reduced  (i.e.,  the  number  of 
variables  removed  from  the  model).  In  our  case,  LRT  ~  xdf=2,  as  we  have  an 

intercept  and  one  predictor  (SE),  and  the  null  model  is  empty  (no  parameters). 
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18.8.3  False  Discovery  Rate  (FDR ) 


The  FDR  provides  one  measure  of  test  or  classifier  performance: 


FDR 


False  Discovery  Rate 


expectation  -s. 


#FalsePositives 

total  number  of  selected  features 

False  Discovery  Proportion 


The  Benjamini-Hochberg  (BH)  FDR  procedure  involves  ordering  the  p- values, 
specifying  a  target  FDR,  calculating  and  applying  the  threshold.  Below  we  show 
how  this  is  accomplished  in  R. 


#p-values  entered  from  smallest  to  largest 

pvaLs  <-  c(0. 9,  0. 35,  0.01,  0.013 ,  0.014 ,  0.19 ,  0.35 ,  0.5 ,  0.63,  0.67,  0.75, 
0.81,  0.01,  0.051) 

Length (pvaLs) 

##  [1]  14 

#enter  the  target  FDR 
alpha. star  <-  0.05 

#  order  the  p-values  small  to  large 
pvaLs  <-  sort (pvaLs) ;  pvaLs 

##  [1]  0.010  0.010  0.013  0.014  0.051  0.190  0.350  0.350  0.500  0.630  0.670 

##  [12]  0.750  0.810  0.900 

#calculate  the  threshold  for  each  p-value 
threshoLd< -alpha . star* (1 : Length(pvaLs ) )/Length(pvaLs ) 

#compare  the  p-value  against  its  threshold  and  display  results 
cbind(pvaLs,  threshold,  pvaLs<=threshoLd) 

##  pvaLs  threshold 

##  [ 1 ,]  0.010  0.003571429  0 

##  [2,]  0.010  0.007142857  0 

##  [3,]  0.013  0.010714286  0 

##  [12,]  0.750  0.042857143  0 
##  [13,]  0.810  0.046428571  0 
##  [14,]  0.900  0.050000000  0 

Start  with  the  smallest  p-value  and  move  up  we  find  that  the  largest  k  for  which 
the  p-value  is  less  than  its  threshold,  a  ,  which  is  k  =4. 

Next,  the  algorithm  rejects  the  null  hypotheses  for  the  tests  that  correspond  to  the 
p-values  /?(1),  p( 2),  P( 3),  P(4y 

Note:  that  since  we  controlled  FDR  at  a*  =  0.05,  we  are  guaranteed  that  on 
average  only  5%  of  the  tests  that  we  rejected  are  spurious.  Since  a*  =  0.05  of  4  is 
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quite  small  and  less  than  1,  we  are  confident  that  none  of  our  rejections  are  expected 
to  be  spurious. 

The  Bonferroni  corrected  a  for  these  data  is  ^  =  0.0036.  If  we  had  used  this 
family- wise  error  rate  in  our  individual  hypothesis  tests,  then  we  would  have 
concluded  that  none  of  our  14  results  were  significant! 


Graphical  Interpretation  of  the  Benjamini-Hochberg  (BH)  Method 

There’s  a  graphical  interpretation  of  the  BH  calculations. 

•  Sort  the  p-values  from  largest  to  smallest. 

•  Plot  the  ordered  p-values  p(k)  on  the  y-axis  versus  their  indexes  on  the  x-axis. 

•  Superimpose  on  this  plot  a  line  that  passes  through  the  origin  and  has  slope  a*. 

Any  p-value  that  falls  on  or  below  this  line  corresponds  to  a  significant  result 
(Fig.  18.18). 


naive 

Bonferroni 


Fig.  18.18  Graphical  representation  of  the  naive,  conservative  Bonferroni,  and  FDR  critical 
p-values 
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#generate  the  values  to  be  plotted  on  x-axis 
x.  vaLues<- (1 : Length(pvaLs) )/Length(pvaLs) 

#widen  right  margin  to  make  room  for  labels 
par(mar=c(4.1,  4.1,  1.1,  4.1)) 

#plot  points 

pLot(x. values,  pvaLs ,  xLab=expression(k/m) ,  ylab="p-value" ) 

#,  ylim=c(0.0J  0.4)) 

#add  FDR  line 

abLine(0,  .05,  coL=2,  Lwd=2) 

#add  naive  threshold  line 
abLine(h=.05,  coL=4,  Lty=2) 

#add  Bonferroni-corrected  threshold  line 
abline(h=.05/length(pvaLs) ,  coL=4,  Lty=2) 

#label  lines 

mtext(c(  ' naive ' ,  ' Bonferroni ' ),  side=4,  at=c(.05,  .05/Length(pvaLs) ) , 

Las=l,  Line=0.2) 

#select  observations  that  are  less  than  threshold 
for. test  <-  cbind(l: Length(pvaLs),  pvaLs) 
pass,  test  <-  for .test [pvaLs  <=  0.05*x. values,  ] 
pass. test 
##  pvaLs 

##  4.000  0.014 

#use  the  largest  k  to  color  points  that  meet  Benjamini-Hochberg  FDR  test 
Last<-ifeLse( is .vector (pass . test),  pass.test[l] , 
pass. test [nrow(pass. test) ,  1]) 

points(x. values [1: Last] ,  pvaLs [1: Last] ,  pch=19,  cex=1.5) 


FDR  Adjusting  the  p- Values 

R  can  automatically  performs  the  Benjamini-Hochberg  procedure.  The  adjusted 
p-values  are  obtained  by 

pvaLs. adjusted  <-  p.adjust(pvals,  "BH") 

The  adjusted  p-values  indicate  the  corresponding  null  hypothesis  we  need  to 
reject  to  preserve  the  initial  a*  false-positive  rate.  We  can  also  compute  the  adjusted 
p-values  as  follows: 
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#calculate  the  term  that  appears  in  the  innermost  minimum  function 

test.p  <-  Length(pvaLs)/(l: Length(pvaLs) )*pvaLs 

test.p 

##  [1]  0.14000000  0.07000000  0.06066667  0.04900000  0.14280000  0.44333333 
##  [7]  0.70000000  0.61250000  0.77777778  0.88200000  0.85272727  0.87500000 
##  [13]  0.87230769  0.90000000 

#use  a  loop  to  run  through  each  p-value  and  carry  out  the  adjustment 
adj.p  <-  numeric (14) 
for(i  in  1:14)  { 

adj . p[i]<-min( test.p[i: Length ( test.p)]) 
ifeLse(adj . p[i] >lj  1}  adj.p[i]) 

} 

adj .  p 

##  [1]  0.0490000  0.0490000  0.0490000  0.0490000  0.1428000  0.4433333  0.6125000 
##  [8]  0.6125000  0.7777778  0.8527273  0.8527273  0.8723077  0.8723077  0.9000000 

Note  that  the  manually  computed  (adj  .  p)  and  the  automatically  computed 
(pvals  .  adjusted)  adjusted-p- values  are  the  same. 


18.8.4  Running  the  Knockoff  Filter 

We  now  run  the  knockoff  filter  along  with  the  Benjamini-Hochberg  (BH)  procedure 
for  controlling  the  false-positive  rate  of  feature  selection.  More  details  about  the 
knock-off  filtering  methods  are  available  online. 

Before  running  either  selection  procedure,  remove  rows  with  missing  values, 
reduce  the  design  matrix  by  removing  predictor  columns  that  do  not  appear  fre¬ 
quently  (e.g.,  at  least  three  times  in  the  sample),  and  remove  any  columns  that  are 
duplicates. 


Library (knockoff) 

#  Direct  call  to  knockoff  filtering  looks  like  this: 
fdr  <-  0.1 

resuLt  =  knockoff  .filter (X}  Y}  fdr  =  fdrj  knockoffs  =  '  equicorreLated ' ) 
names(resuLt$seLected) 

##  [1]  "L_cinguLate_gyrus_ComputeArea"  "R_putamen_ComputeArea" 

##  [3]  "Sex"  "\4 eight" 

##  [5]  "Age"  " chrl2_rs34637584_GT" 

##  [7]  "chrl7_rsll012_GT"  "chrl7_rsl99533_GT" 

knockoff _seLected  <-  names (resuLt$seLected) 

#  Run  BH  (Benjamini-Hochberg) 
k  =  ncoL(X) 

Lm.fit  =  Lm(Y  ~  X  -  1)  #  no  intercept 

#  Alternatively:  dat  =  as . data .f rame(cbind(Y,X) ) 

#  lm.fit  =  lm(Y  ~  .  -l,data=dat  )  #  no  intercept 
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p.  values  =  coef (summary (Lm.  fit)) [j  4] 

cutoff  =  max(c(0j  which(sort(p. values)  <=  fdr  *  (l:k)  /  k))) 
BH_selected  =  names(which(p .values  <=  fdr  *  cutoff  /  k)) 


knockoff _selected;  BH_selected 


##  [1]  "L_cingulate_gyrus_ComputeArea 
##  [3]  "Sex" 

##  [5]  "Age" 

##  [7]  "chr!7_rsll012_GT" 


" R_putamen_ComputeArea " 
"Weight" 

" chrl2_rs34637584_GT" 
"chr!7_rsl99533_GT" 


##  [ 1 ]  "XL_putamen_ComputeArea 
##  [3]  "XSex " 

##  [5]  "XAge " 

##  [7]  "Xchrl7_rsll012_GT" 


"XL_putamen_Volume  " 

"XW eight" 

"Xchrl7_rsll868035_GT" 

"Xchrl7_rsl2185268_GT" 


#  Housekeeping:  remove  the  "X"  prefixes  in  the  BH_selected  list  of  features 
for(i  in  1: Length ( BH_se lected) ){ 

BH_selected[i]  <-  substring (BH_selected [i] j  2) 

} 


intersect (BH_se Lected j knockoff _selected) 

##  [1]  "Sex"  "Weight"  "Age 

##  [4]  "chrl7_rs!1012_GT" 


We  see  that  there  are  some  features  that  are  selected  by  both  methods  suggesting 
they  may  be  indeed  salient. 

Try  to  apply  some  of  these  techniques  to  other  data  from  the  list  of  our  Case- 
Studies. 


18.9  Assignment:  18.  Regularized  Linear  Modeling 
and  Knockoff  Filtering 

Use  the  ALS  (Case  Study  15)  data  to: 

•  Detect  and  impute  missing  value  if  any. 

•  Use  the  ALSFRS_slope  as  a  clinically  relevant  outcome  variable. 

•  Randomly  split  data  into  training  (70%)  and  testing  (30%)  datasets. 

•  Use  the  LASSO  to  fit  a  model  with  cross  validation  (with  optimized  regulariza¬ 
tion  parameter)  and  visualize  the  result. 

•  Similarly,  train  a  ridge  regression  model. 

•  Train  OLS  model  and  improve  it  with  stepwise  variable  selection. 

•  Report  the  coefficient  estimates  for  OLS,  Stepwise  OLS  with  AIC,  Ridge  and 
LASSO. 

•  Calculate  the  predicted  values  for  all  4  models  and  report  the  models  performance 
metircs  (RMSE  and  R 2). 

•  Apply  knockoff  filtering  for  variable  selection,  controlling  the  false 
discovery  rate. 

•  Compare  the  variables  selected  by  Stepwise  OLS,  LASSO  and  knockoff. 
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Chapter  19 

Big  Longitudinal  Data  Analysis 


® 

Check  for 
updates 


The  time-varying  (longitudinal)  characteristics  of  large  information  flows  represent  a 
special  case  of  the  complexity  and  the  dynamic  multi-scale  nature  of  big  biomedical 
data  that  we  discussed  in  the  DSPA  Motivation  section.  Previously,  in  Chap.  4,  we 
saw  space-time  (4D)  functional  magnetic  resonance  imaging  (fMRI)  data,  and  in 
Chap.  16  we  discussed  streaming  data,  which  also  has  a  natural  temporal  dimension. 
Now  we  will  go  deeper  into  managing,  modeling  and  analyzing  big  longitudinal  data. 

In  this  Chapter,  we  will  expand  our  predictive  data  analytic  strategies  specifically 
for  analyzing  big  longitudinal  data.  We  will  interrogate  datasets  that  track  the  same 
type  of  information,  for  the  same  subjects,  units  or  locations,  over  a  period  of  time. 
Specifically,  we  will  present  time  series  analysis,  forecasting  using  autoregressive 
integrated  moving  average  (ARIMA)  models,  structural  equation  models  (SEM), 
and  longitudinal  data  analysis  via  linear  mixed  models. 


19.1  Time  Series  Analysis 

Time  series  analysis  relies  on  models  like  ARIMA  (Autoregressive  integrated  mov¬ 
ing  average)  that  utilize  past  longitudinal  information  to  predict  near  future  outcomes. 
Times  series  data  tent  to  track  univariate,  sometimes  multivariate,  processes  over  a 
continuous  time  interval.  The  stock  market,  e.g.,  daily  closing  value  of  the  Dow  Jones 
Industrial  Average  index,  electroencephalography  (EEG)  data,  and  functional  mag¬ 
netic  resonance  imaging  provide  examples  of  such  longitudinal  datasets  (timeserties). 
The  basic  concepts  in  time  series  analysis  include: 

•  The  characteristics  of  (second-order)  stationary  time  series  (e.g.,  first  two 
moments  are  stable  over  time)  do  not  depend  on  the  time  at  which  the  series 
process  is  observed. 

•  Differencing  -  a  transformation  applied  to  time-series  data  to  make  it  stationary. 

/ 

Differences  between  consecutive  time-observations  may  be  computed  by  yt 
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=  yt  ~  yt  -  i-  Differencing  removes  the  level  changes  in  the  time  series,  eliminates 
trend,  reduces  seasonality,  and  stabilizes  the  mean  of  the  time  series.  Differencing 
the  time  series  repeatedly  may  yield  a  stationary  time  series.  For  example,  a 
second  order  differencing: 

y"  =  yt'-y<-  / 

=  (yt-yt- 1)  -  (yt-\  -yt- 1)- 

=  yt-  2 yt_x  +  y,_2 

•  Seasonal  differencing  is  computed  as  a  difference  between  one  observation  and 
its  corresponding  observation  in  the  previous  epoch,  or  season  (e.g.,  annually, 
there  are  m  =  4  seasons),  like  in  this  example: 

yt  =  yt  —  yt_m  where  m  =  number  of  seasons. 

•  The  differenced  data  may  then  be  used  to  estimate  an  ARM  A  model. 

We  will  use  the  Beijing  air  quality  PM2.5  dataset  as  an  example  to  demonstrate 
the  analysis  process.  This  dataset  measures  air  pollutants  -  PM2.5  particles  in 
micrograms  per  cubic  meter  over  a  period  of  8  years  (2008-2016).  It  measures  the 
hourly  average  of  the  number  of  particles  that  are  of  size  2.5  microns  (PM2.5)  once 
per  hour  in  Beijing,  China. 

Let’s  first  import  the  dataset  into  R. 


beijing. pm25<-read. csv (" https : //umich . instructure. com/ files/ 1823138/ down  Load 
?doiA/n Lood_frd=l ") 
summary (beij ing . pm25 ) 


## 

Index 

Site 

Parameter 

Date. 

1ST. 

## 

Min. 

1 

Beijing: 69335 

PM2. 5: 69335 

3/13/2011  3:00 

2 

## 

1st  Qu. 

17335 

3/13/2016  3:00 

2 

## 

Median 

34668 

3/14/2010  3:00 

2 

## 

Mean 

34668 

3/8/2009  3:00 

2 

## 

3rd  Qu. 

52002 

3/8/2015  3:00 

2 

## 

Max. 

69335 

3/9/2014  3:00 

2 

## 

(Other) 

69323 

## 

Year 

Month 

Day 

Hour 

## 

Min. 

'2008 

Min.  :  1.000 

Min .  :  1.00 

Min.  :  0.0 

## 

1st  Qu. 

'2010 

1st  Qu. :  4.000 

1st  Qu. :  8.00 

1st  Qu. :  5.5 

## 

Median 

'2012 

Median  :  6.000 

Median  :16.00 

Median  :11.0 

## 

Mean 

'2012 

Mean  :  6.407 

Mean  :15.73 

Mean  :11.5 

## 

3rd  Qu. 

'2014 

3rd  Qu.:  9.000 

3rd  Qu . :23. 00 

3rd  Qu. :17 .5 

## 

Max. 

'2016 

Max.  : 12. 000 

Max.  :31.00 

Max.  :23.0 

## 

## 

Value 

Duration 

QC.Name 

## 

Min. 

-999. 

00  1  Hr: 69335 

Missing:  4408 

## 

1st  Qu. 

22. 

00 

Valid  : 64927 

## 

Median 

63. 

00 

## 

Mean 

24. 

99 

## 

3rd  Qu. 

125. 

00 

19.1  Time  Series  Analysis 
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The  Value  column  records  PM2.5  AQI  (Air  Quality  Index)  for  8  years.  We 
observe  that  there  are  some  missing  data  in  the  Value  column.  By  looking  at  the 
QC.Name  column,  we  only  have  about  6.5%  (4408  observations)  missing  values. 
One  way  of  solving  data-missingness  problems,  where  incomplete  observations  are 
recorded,  is  to  replace  the  absent  elements  by  the  corresponding  variable  mean. 

beij ing . pm25[beij ing . pm25$VaLue== - 999 ,  9]<-NA 

beij ing . pm25[ is,na(beijing. pm25$Va iue),  9]< -floor ( mean ( beij ing . pm25$Va L ue> 
na.rm  =  T)) 

Here  we  first  reassign  the  missing  values  into  NA  labels.  Then  we  replace  all  NA 
labels  with  the  mean  computed  using  all  non-missing  observations.  Note  that  the 
floor  ( )  function  casts  the  arithmetic  averages  as  integer  numbers,  which  is  needed 
as  AQI  values  are  expected  to  be  whole  numbers. 

Now,  let’s  observe  the  trend  of  hourly  average  PM2.5  across  1  day.  You  can  see  a 
significant  pattern:  The  PM2.5  level  peeks  in  the  afternoons  and  is  the  lowest  in  the 
early  mornings.  It  exhibits  approximate  periodic  boundary  conditions  (these  patterns 
oscillate  daily)  (Fig.  19.1). 


Beijing  hour  average  PM2.5  from  2008-2016 
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Fig.  19.1  Time  course  of  the  mean,  top-20 %,  and  bottom-20%  air  quality  in  Beijing  (PPM2.5) 
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require(ggpLot2 ) 

id  =  1 :nrow(beijing . pm25) 
mat  =  matrix(0, nrow=24,  ncol=3) 
stat  =  function(x){ 

c(mean(beijing . pm25[iidj  "Value"] ),  quantile (beijing . pm25 [iid ,  "Value"], c(0.2 
,0.8))) 

} 

for  (i  in  1:24){ 

iid  =  which (id%%24==i-l) 
mat[i,]  =  stat(iid) 

} 

mat  <-  as. data. frame (mat) 
colnames(mat)  <-  c("mean",  "20%",  "80%") 
mat$time  =  c(15:23,0:14) 
require(reshape2 ) 

##  Loading  required  package:  reshape2 

dt  <-  melt(mat,id="time") 
colnames(dt) 

##  [1]  "time"  "variable"  "value" 

ggplot(data  =  dt, mapping  =  aes(x=time,y=value,color=variable))+geom_line()+ 
scale_x_continuous (breaks  =  0: 23)+ggtitle( "Beijing  hour  average  PM2.5  from 
2008-2016") 

Are  there  any  daily  or  monthly  trends?  We  can  start  the  data  interrogation  by 
building  an  ARIMA  model  and  examining  detailed  patterns  in  the  data. 


19.1.1  Step  1:  Plot  Time  Series 

To  begin  with,  we  can  visualize  the  overall  trend  by  plotting  PM2.5  values  against 
time.  This  can  be  achieved  using  the  plyr  package. 

Library(plyr) 

ts<-ts (beijing .pm25$Value,  start=l,  end=69335,  frequency=l) 
ts.plot( ts) 

The  dataset  is  recorded  hourly,  and  the  8-year  time  interval  includes  about  69,335  h 
of  records.  Therefore,  we  start  at  the  first  hour  and  end  with  69,  335th  h.  Each  hour  has 
a  univariate  PM2.5  AQI  value  measurement,  so  f  requency=l. 

From  this  time  series  plot,  Fig.  19.2,  we  observe  that  the  data  has  some  peaks  but 
most  of  the  AQIs  stay  under  300  (which  is  considered  hazardous). 

The  original  plot  seems  have  no  trend  at  all.  Remember  we  have  our  measure¬ 
ments  in  hours.  Will  there  be  any  difference  if  we  use  monthly  average  instead  of 
hourly  reported  values?  In  this  case,  we  can  use  Simple  Moving  Average  (SMA) 
technique  to  smooth  the  original  graph. 
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Time 

Fig.  19.2  Raw  time-series  plot  of  the  Beijing  air  quality  measures  (2008-2016) 
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Fig.  19.3  Simple  moving  monthly  average  PM2.5  air  quality  index  values 


To  accomplish  this,  we  need  to  install  the  TTR  package  and  utilize  the  SMA() 
method  (Fig.  19.3). 


#install. packages ("TTR" ) 

Library(TTR) 

bj .month<-SMA(tSj  n=720) 

pLot .ts(bj .month j  moln=" Monthly  PM2.5  Level  SMA",  ylab="PM2.5  AQI") 

Here  we  chose  n  to  be  24  *  30  =  720,  and  we  can  see  some  pattern.  It  seems  that 
for  the  first  4  years  (or  approximately  35,040  h),  the  AQI  fluctuates  less  than  the  last 
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Fig.  19.4  Exponentially-weighted  monthly  mean  of  PM2.5  air  quality 


5  years.  Let’s  see  what  happens  if  we  use  exponentially -weighted  mean ,  instead  of 
arithmetic  mean. 

bj .month<-EMA(tSj  n=l,  ratio  =  2/(720+1) ) 

pLot .ts(bj .month j  main="MonthLy  PM2.5  Level  EMA",  yiab="PM2.5  AQI") 

The  pattern  seems  less  obvious  in  this  graph,  Fig.  19.4.  Here  we  used  exponential 
smoothing  ratio  of  2/(n  +  1). 


19.1.2  Step  2:  Find  Proper  Parameter  Values 
for  AR1MA  Model 

ARIMA  models  have  2  components:  autoregressive  (AR)  part  and  moving  average 
(MA)  part.  An  ARMA(p ,  d ,  q)  model  is  a  model  with  p  terms  in  AR,  q  terms  in  MA, 
and  d  representing  the  order  difference.  Differencing  is  used  to  make  the 
original  dataset  approximately  stationary.  ARMA(p,d,q)  has  the  following  analyti¬ 
cal  form: 


i  -  yy  j  (i  -  Lfx,  =  (i+y2  e/ 
i=  1  J  \  i=  1 
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19.1.3  Check  the  Differencing  Parameter 

First,  let’s  try  to  determine  the  parameter  d.  To  make  the  data  stationary  on  the  mean 
(remove  any  trend),  we  can  use  first  differencing  or  second  order  differencing. 
Mathematically,  first  differencing  is  taking  the  difference  between  two  adjacent 
data  points: 


y/  =  yt 

While  second  order  differencing  is  differencing  the  data  twice: 

y*  =  y!  -  yt-\  =yt-  2yt-\  +  yt~2- 

Let’s  see  which  differencing  method  is  proper  for  the  Beijing  PM2.5  dataset. 
Function  dif  f  ( )  in  R  base  can  be  used  to  calculate  differencing.  We  can  plot  the 
differences  by  plot .  ts  ( )  (Fig.  19.5). 

par(mfrow=  c(2 ,  1)) 
bj .diff2<-diff(tSj  differences=2) 
piot.ts(bj.diff2j  main="2nd  differencing" ) 
bj .  diff<-diff( tSj  differences=l ) 
pLot.ts(bj.diffj  main="lst  differencing" ) 

Neither  of  them  appears  quite  stationary.  In  this  case,  we  can  consider  using  some 
smoothing  techniques  on  the  data  like  we  just  did  above  (b  j  .  month<  -  SMA  ( t  s  , 
n  =72  0 ) ).  Let’s  see  if  smoothing  by  exponentially-weighted  mean  (EM A)  can  help 
making  the  data  approximately  stationary  (Fig.  19.6). 
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Fig.  19.5  First-  and  second-order  differencing  of  the  AQI  data 
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Fig.  19.6  Monthly-smoothed  first-  and  second-order  differencing  of  the  AQI  data 


par(mfrow=c(2j  1)) 

bj . diff2<-diff(bj . month ,  differences=2 ) 
plot .ts(bj .diff2j  main="2nd  differencing" ) 
bj . diff<-diff( bj . month ,  differences=l ) 
plot.ts(bj.diffj  main="lst  differencing" ) 

Both  of  these  EMA-filtered  graphs  have  tempered  variance  and  appear  pretty 
stationary  with  respect  to  the  first  two  moments,  mean  and  variance. 


19.1.4  Identifying  the  AR  and  MA  Parameters 

To  decide  the  auto-regressive  (AR)  and  moving  average  (MA)  parameters  in  the 
model  we  need  to  create  autocorrelation  factor  (ACF)  and  partial  autocorrelation 
factor  (PACF)  plots.  PACF  may  suggest  a  value  for  the  AR-term  parameter  q,  and 
ACF  may  help  us  determine  the  M A- term  parameter  p.  We  plot  the  ACF  and  PACF 
using  the  approximately  stationary  time  series,  bj  .  dif  f  object  (Fig.  19.7). 

por(mfroiAj=c(lj  2)) 

acf(ts(bj.diff)j  log. max  =  20 j  moin="ACF" ) 
pacf(ts(bj.diff)j  log. max  =  20 }  main="PACF") 

•  Pure  AR  model,  (q  =  0),  will  have  a  cut  off  at  lag  p  in  the  PACF. 

•  Pure  MA  model,  (p  =  0),  will  have  a  cut  off  at  lag  q  in  the  ACF. 

•  ARIMA(p,  q)  will  (eventually)  have  a  decay  in  both. 


19.1  Time  Series  Analysis 


631 


AGF 


PACF 


CO 

b 


tO 

u_ 

o 

< 

o 

TO 

tr 

f0 

Cl 

d 

CM 

b 

o 

o 


5  10  15  20 


Lag 


Lag 


Fig.  19.7  Autocorrelation  factor  (ACF)  and  partial  autocorrelation  factor  (PACF)  plots  of 
bj . dif f 


All  spikes  in  the  plots  are  outside  of  the  (normal)  insignificant  zone  in  the  ACF 
plot  while  two  of  them  are  significant  in  the  PACF  plot.  In  this  case,  the  best  ARIMA 
model  is  likely  to  have  both  AR  and  MA  parts. 

We  can  examine  for  seasonal  effects  in  the  data  using  stats:  :stl(),  a 
flexible  function  for  decomposing  and  forecasting  the  series,  which  uses  averaging 
to  calculate  the  seasonal  component  of  the  series  and  then  subtracts  the  seasonality. 
Decomposing  the  series  and  removing  the  seasonality  can  be  done  by  subtracting  the 
seasonal  component  from  the  original  series  using  forecast:  :  seasadj  (). 
The  frequency  parameter  in  the  ts  ( )  object  specifies  the  periodicity  of  the  data 
or  the  number  of  observations  per  period,  e.g.,  30,  for  monthly  smoothed  daily 
data  (Fig.  19.8). 

count_ma  =  ts(bj . month j  frequency=30) 
decomp  =  stL(count_maj  s .window=" periodic " ) 
deseasonai_count  <-  forecast :: seasadj (decomp) 
pLot( decomp) 

The  augmented  Dickey-Fuller  (ADF)  test,  tseries:  :adf.test  can  be  used 
to  examine  the  timeseries  stationarity.  The  null  hypothesis  is  that  the  series  is 
non- stationary.  The  ADF  test  quantifies  if  the  change  in  the  series  can  be  explained 
by  a  lagged  value  and  a  linear  trend.  Non- stationary  series  can  be  corrected  by 
differencing  to  remove  trends  or  cycles. 
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lime 

Fig.  19.8  Trend  and  seasonal  decomposition  of  the  time- series 


tseries: :adf .test (count_ma,  aLternative  =  "stationary") 

##  Augmented  Dickey-FuLLer  Test 
## 

##  data:  count_ma 

##  Dickey-FuLLer  =  - 8.0313 ,  Lag  order  =  41,  p-vaLue  =  0.01 
##  aLternative  hypothesis :  stationary 

tseries: :adf.test(bj .diff ,  aLternative  =  "stationary" ) 

##  Augmented  Dickey-FuLLer  Test 
## 

##  data:  bj.diff 

##  Dickey-FuLLer  =  -29.188,  Lag  order  =  41,  p-vaLue  =  0.01 
##  aLternative  hypothesis :  stationary 

We  see  that  we  can  reject  the  null  and  therefore,  there  is  no  statistically  significant 
non-stationarity  in  the  bj  .diff  timeseries. 


19.1.5  Step  3:  Build  an  ARIMA  Model 

As  we  have  some  evidence  suggesting  d  =  7,  the  auto  .  arima  ( )  function  in  the 
forecast  package  can  help  us  to  find  the  optimal  estimates  for  the  remaining  pair 
parameters  of  the  ARIMA  model,  p  and  q. 


-0.005  0.005 
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Series  residuals(fit) 


Fig.  19.9  ACF  of  the  time-series  residuals 


#  install . packages( "forecast" ) 

Library (forecast) 

f it <- auto. arima(bj .month j  approx=Fj  trace  =  F) 
fit 

##  Series:  bj. month 
##  ARIMA(lj  lj4) 

## 

##  Coefficients : 

##  arl  mal 

##  0.9426  0.0813 

##  s.e.  0.0016  0.0041 

## 

##  sigma*2  estimated  as 
##  AIC=-176311 . 8  AICc 

Acf( residuals (fit) ) 

Finally,  the  optimal  model  determined  by  the  step-wise  selection  is  ARIMA 
( 1 ,  1 ,  4 ) .  The  residual  plot  is  show  on  Fig.  19.9. 

We  can  also  use  external  information  to  fit  ARIMA  models.  For  example,  if  we 
want  to  add  the  month  information,  in  case  we  suspect  a  seasonal  change  in  PM2.5 
AQI,  we  can  use  the  following  script. 


ma2  ma3  ma4 
0.0323  0.0156  0.0074 
0.0041  0.0041  0.0041 

0.004604:  Log  LikeLihood=88161 .91 
=  -176311.8  BIC= - 1 76257 
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f itl<- auto. arima(bj .month ,  xreg=beijing . pm25$Month,  approx=Fj  trace  =  F) 
fitl 

##  Series:  bj. month 

##  Regression  with  ARIMA(1, 1,4)  errors 
## 

##  Coefficients : 


## 

arl 

mal 

ma2 

ma3 

ma4 

beijing . pm25$Month 

## 

0.9427 

0.0813 

0.0322 

0.0156 

0.0075 

-0.0021 

##  s.e. 

0.0016 

0.0041 

0.0041 

0.0041 

0.0041 

0.0015 

## 

##  sigma/K2  estimated  as  0.004604:  Log  LikeLihood=88162.9 
##  AIC=-176311 . 8  AICc=-176311 . 8  BIC=-176247 . 8 

fit3<-arima(bj .month,  order  =  c(2,  1,  0) ) 
fit  3 

##  CaLL: 

##  arima(x  =  bj. month,  order  =  c(2,  1,  0)) 

## 

##  Coefficients : 

##  arl  ar2 

##  1.0260  -0.0747 

##  s.e.  0.0038  0.0038 

## 

##  sigma/K2  estimated  as  0.004606:  Log  LikeLihood  =  88138 .32}aic=-176270 .6 


We  want  the  model  AIC  and  BIC  to  be  as  small  as  possible.  In  terms  of  AIC  and 
BIC,  this  model  is  not  drastically  different  compared  to  the  last  model  without 
Month  predictor.  Also,  the  coefficient  of  Month  is  very  small  and  not  significant 
(according  to  the  t-test)  and  thus  can  be  removed. 

We  can  examine  further  the  ACF  and  the  PACF  plots  and  the  residuals  to 
determine  the  model  quality.  When  the  model  order  parameters  and  structure  are 
correctly  specified,  we  expect  no  significant  autocorrelations  present  in  the  model 
residual  plots. 

tsdispiay  (residuals  (fit) j  Lag .max=45j  main='  (1,1,4)  ModeL  ResiduaLs ' ) 


There  is  a  clear  pattern  present  in  ACF/PACF  plots,  Fig.  19.10,  suggesting  that 
the  model  residuals  repeat  with  an  approximate  lag  of  12  or  24  months.  We  may  try  a 
modified  model  with  a  different  parameters,  e.g.,  p  =  24  or  q  =  24.  We  can  define  a 
new  displayForecastErrors  ()  function  to  show  a  histogram  of  the  fore¬ 
casted  errors  (Figs.  19.11  and  19.12). 
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(1,1,4}  Model  Residuals 


Fig.  19.10  ARIMA(1,1,4)  model  plot,  ACF  and  PACF  plots  of  the  resiguals  for  bj  .month 
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Fig.  19.11  An  improved  ARIMA(  1,1,24)  model  plot,  ACF  and  PACF  plots  of  the  resiguals  for 
bj .month 
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Fig.  19.12  Diagnostic  plot  of  the  residuals  of  the  ARIMA(1,1,24)  time-series  model  for  bj  . 
month 


fit24  <-  arima(deseasonaL_countj  order=c(lj 1,  24) ) ;  fit24 
##  Cali: 

##  arima(x  =  deseasonal_countj  order  =  c(lj  2,  24)) 

## 

##  Coefficients : 


## 

arl 

mal 

ma2 

ma3  ma4  ma5 

ma6 

ma7 

## 

0.9496 

0.0711  0.0214  0.0054  -0.0025  -0.0070 

-0.0161 

-0.0149 

## 

s.e. 

0.0032 

0.0049  0 

. 0049  0 . 

0048  0.0047  0.0046 

0.0045 

0. 0044 

## 

ma8 

ma9 

mal0 

mall  mal2  mal3 

mal4 

## 

-0.0162 

-0.0118 

-0. 0100 

-0.0136  -0.0045  -0.0055  -0. 

0075 

## 

s.e. 

0. 0044 

0.0043 

0.0042 

0.0042  0.0042  0.0041  0. 

0041 

## 

mal5 

mal  6 

mal7 

mal8  mal9  ma20 

ma21 

ma22 

## 

-0. 0060 

-0.0005 

-0.0019 

0.0066  0.0088  0.0156 

0.0247 

0.0117 

## 

s.e. 

0.0041 

0.0041 

0.0041 

0.0041  0.0041  0.0040 

0 . 0040 

0 . 0040 

## 

ma23 

ma24 

## 

0.0319 

0.0156 

## 

s.e. 

0.  0040 

0.0039 

## 

## 

sigma 

A2  estimated  as  0. 

004585: Log  livelihood  =  88295.88 , 

aic  =  - 

176539.8 

tsdisplay  (residuals  (fit24) }  Lag.max=36j  main= ' Seasonal  Model  Residuals') 


displayForecastErrors  <-  function(forecastErrors) 

{ 

#  Generate  a  histogram  of  the  Forecast  Errors 
binsize  <-  IQR(forecastErrors)/4 

sd  <-  sd(forecastErrors) 
min  <-  min(forecastErrors)  -  sd 
max  <-  max(forecastErrors)  +  sd 

#  Generate  5K  normal(0Jsd)  RVs 
norm  <-  rnorm(5000j  mean=0j  sd=sd) 
min2  <-  min (norm) 

max2  <-  max(norm) 
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if  (min2  <  min)  {  min  <-  min2  } 
if  (max 2  >  max)  {  max  <-  max 2  } 

#  Plot  red  histogram  of  the  forecast  errors 
bins  <-  seq(minj  maXj  binsize) 

hist(forecastErrorSj  col="red"j  freq=FALSEj  breabs=bins) 
myHist  <-  hist(normj  plot=FALSEj  breaks=bins) 

#  Overlay  the  Blue  normal  curve  on  top  of  forecastErrors  histogram 
points  (my  Hist$midSj  myHist$density ,  type="l"}  coL="bLue'1 ,  Lwd=2) 

> 

dispLayForecastErrors ( residuaLs(fit24 ) ) 


19.1.6  Step  4:  Forecasting  with  ARIMA  Model 

Now,  we  can  use  our  models  to  make  predictions  for  future  PM2.5  AQI.  We  will  use 
the  function  forecast  ( )  to  make  predictions.  In  this  function,  we  have  to  specify 
the  number  of  periods  we  want  to  forecast.  Using  the  smoothed  data,  we  can  make 
predictions  for  the  next  month,  July  2016.  As  each  month  has  about 
24  x  30  =  720  h,  we  specify  a  horizon  h  =  72  0  (Fig.  19.13). 

par(mfro\Ai=c(lj  1)) 

ts  .forecasts<- forecast  (fit j  h=720) 

pLot(ts .forecasts j  include  =  2880) 

When  plotting  the  forecasted  values  with  the  original  smoothed  data,  we  include 
only  the  last  3  months  in  the  original  smoothed  data  to  see  the  predicted  values  clearer. 
The  shaded  regions  indicate  ranges  of  expected  errors.  The  darker  (inner)  region 
represents  by  80%  confidence  range  and  the  lighter  (outer)  region  bounds  by  the 

Forecasts  from  ARIMA(1,1,4) 


66500  67000  67500  68000  6S500  69000  69500  70000 


Fig.  19.13  Prospective  out-of-range  prediction  intervals  of  the  ARIMA(1,1,4)  time-series  model 
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http://www.seasonal. website/ 


Seasonal  Web-demo  -  R-interface  to  X-13AR IMA- SEATS 


Seasonal  adjustment  software  developed  by  the  United  States  Census 


Fig.  19.14  Live  Demo:  Interactive  US  Census  ARIMA  modeling 


95%  interval.  Obviously  near-term  forecasts  have  tighter  ranges  of  expected  errors, 
compared  to  longer-term  forecasts  where  the  variability  naturally  expands.  A  live 
demo  of  US  Census  data  is  shown  on  Fig.  19.14. 


19.2  Structural  Equation  Modeling  (SEM)-Latent 
Variables 

Timeseries  analyses  provide  effective  strategies  to  interrogate  longitudinal  univari¬ 
ate  data.  What  happens  if  we  have  multiple,  potentially  associated,  measurements 
recorded  at  each  time  point? 

SEM  is  a  general  multivariate  statistical  analysis  technique  that  can  be  used  for 
causal  modeling/inference,  path  analysis,  confirmatory  factor  analysis  (CFA), 
covariance  structure  modeling,  and  correlation  structure  modeling.  This  method 
allows  separation  of  observed  and  latent  variables.  Other  standard  statistical  pro¬ 
cedures  may  be  viewed  as  special  cases  of  SEM,  where  statistical  significance  may 
be  less  important,  and  covariances  are  the  core  of  structural  equation  models. 

Latent  variables  are  features  that  are  not  directly  observed  but  may  be  inferred 
from  the  actually  observed  variables.  In  other  words,  a  combination  or  transforma¬ 
tion  of  observed  variables  can  create  latent  features,  which  may  help  us  reduce  the 
dimensionality  of  data.  Also,  SEM  can  address  multi-collinearity  issues  when  we  fit 
models  because  we  can  combine  some  high  collinearity  variables  to  create  a  single 
(latent)  variable,  which  can  then  be  included  into  the  model. 


19.2.1  Foundations  of  SEM 

SEMs  consist  of  two  complementary  components:  (1)  a  path  model ,  quantifying 
specific  cause-and-effect  relationships  between  observed  variables,  and  (2)  a  mea¬ 
surement  model ,  quantifying  latent  linkages  between  unobservable  components  and 
observed  variables.  The  LISREL  (Linear  Structural  RELations)  framework  repre¬ 
sents  a  unifying  mathematical  strategy  to  specify  these  linkages,  see  Grace  2006. 
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The  most  general  kind  of  SEM  is  a  structural  regression  path  model  with  latent 
variables,  which  account  for  measurement  errors  of  observed  variables.  Model 
identification  determines  whether  the  model  allows  for  unique  parameter  estimates 
and  may  be  based  on  model  degrees  of  freedom  ( dfM  >  0)  or  a  known  scale  for  every 
latent  feature.  If  v  represents  the  number  of  observed  variables,  then  the  total  degrees 
of  freedom  for  a  SEM,  v^v\  corresponds  to  the  number  of  variances  and  unique 
covariances  in  a  variance-covariance  matrix  for  all  the  features,  and  the  model 
degrees  of  freedom,  df  M  =  —  /,  where  l  is  the  number  of  estimated  parameters. 

Examples  include: 

•  Just-identified  model  ( dfM  =  0)  with  unique  parameter  estimates, 

•  Over-identified  model  ( dfM  >  0)  desirable  for  model  testing  and  assessment, 

•  Under-identified  model  ( dfM  <  0)  is  not  guaranteed  unique  solutions  for  all 
parameters.  In  practice,  such  models  occur  when  the  effective  degrees  of  freedom 
are  reduced  due  to  two  or  more  highly-correlated  features,  which  presents 
problems  with  parameter  estimation.  In  these  situations,  we  can  exclude  or 
combine  some  of  the  features  boosting  the  degrees  of  freedom. 

The  latent  variables’  scale  property  reflects  their  unobservable,  not  measurable, 
characteristics.  The  latent  scale,  or  unit,  may  be  inferred  from  one  of  its  observed 
constituent  variables,  e.g.,  by  imposing  a  unit  loading  identification  constraint  fixing 
at  1.0  the  factor  loading  of  one  observed  variable. 

An  SEM  model  with  appropriate  scale  and  degrees  of  freedom  conditions  may  be 
identifiable  subject  to  Bollen’s  two-step  identification  rule.  When  both  the  CFA  path 
components  of  the  SEM  model  are  identifiable,  then  the  whole  SR  model  is 
identified,  and  model  fitting  can  be  initiated. 

•  For  the  confirmatory  factor  analysis  (CFA)  part  of  the  SEM,  identification 
requires  (1)  a  minimum  of  two  observed  variables  for  each  latent  feature,  (2)  inde¬ 
pendence  between  measurement  errors  and  the  latent  variables,  and  (3)  indepen¬ 
dence  between  measurement  errors. 

•  For  the  path  component  of  the  SEM,  ignoring  any  observed  variables  used  to 
measure  latent  variables,  model  identification  requires:  (1)  errors  associated  with 
endogenous  latent  variables  to  be  uncorrelated,  and  (2)  all  causal  effects  to  be 
unidirectional. 

The  LISREL  representation  can  be  summarized  by  the  following  matrix 
equations: 


measurement  model  component 


x  —  AXE,  +  5, 

y  =  Ayfj  H-  6. 


And 


path  model  component  rj  =  Brj  +  f , 
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where: 

•  xp  x  i  is  a  vector  of  observed  exogenous  variables  representing  a  linear  function 
of  x  i,  vector  of  exogenous  latent  variables , 

•  5p  x  i  is  a  vector  of  measurement  error ,  Ax  is  a  p  x  j  matrix  of  factor  loadings 
relating  v  to  f , 

•  yq  x  i  is  a  vector  of  observed  endogenous  variables , 

•  77/,  x  i  is  a  vector  of  endogenous  latent  variables , 

•  x  t  is  a  vector  of  measurement  error  for  the  endogenous  variables ,  and 

•  Ay  is  a  q  x  k  matrix  of  factor  loadings  relating  y  to  77. 

Let’s  also  denote  the  two  variance-covariance  matrices,  Os(p  x  P)  and  <9e(g  x  q ), 
representing  the  variance-covariance  matrices  among  the  measurement  errors  5  and 
e,  respectively.  The  third  equation  describing  the  LISREL  path  model  component  as 
relationships  among  latent  variables  includes: 

•  Bk  x  k  a  matrix  of  path  coefficients  describing  the  relationships  among  endoge¬ 
nous  latent  variables , 

•  f/.  x  j  as  a  matrix  of  path  coefficients  representing  the  linear  effects  of  exogenous 
variables  on  endogenous  variables , 

•  f*.  x  1  as  a  vector  of  errors  of  endogenous  variables ,  and  the  corresponding  two 
variance-covariance  matrices  x  7  of  the  toertf  exogenous  variables ,  and 

•  x  &  of  the  errors  of  endogenous  variables. 

The  basic  statistic  for  a  typical  SEM  implementation  is  based  on  covariance 
structure  modeling  and  model  fitting  relies  on  optimizing  an  objective  function, 
min {f(X,  S)},  representing  the  difference  between  the  model-implied  variance- 
covariance  matrix,  X,  predicted  from  the  causal  and  non-causal  associations  speci¬ 
fied  in  the  model,  and  the  corresponding  observed  variance-covariance  matrix  S, 
which  is  estimated  from  observed  data.  The  objective  function,  f(X,  S)  can  be 
estimated  as  shown  below,  see  Shipley  2016. 

In  general,  causation  implies  correlation,  suggesting  that  if  there  is  a  causal  rela¬ 
tionship  between  two  variables,  there  must  also  be  a  systematic  relationship  between 
them.  Specifying  a  set  of  theoretical  causal  paths,  we  can  reconstmct  the  model-implied 
variance-covariance  matrix,  X,  from  total  effects  and  unanalyzed  associations.  The 
LISREL  strategy  specifies  the  following  mathematical  representation: 


AyA(r&r'  +  W)A,A,y  +  0e  AyAr@A'x 
Ax0r'A'A'y  Ax0A'x  +  05 


where  A  =  (/  —  B)~l .  This  representation  of  X  does  not  involve  the  observed  and 
latent  exogenous  and  endogenous  variables,  x,  y,  g,  r\.  Maximum  likelihood  estima¬ 
tion  (MLE)  may  be  used  to  obtain  the  X  parameters  via  iterative  searches  for  a  set  of 
optimal  parameters  minimizing  the  element-wise  deviations  between  X  and  S. 

The  process  of  optimizing  the  objective  function  f(X,  S )  can  be  achieved  by 
computing  the  log  likelihood  ratio,  i.e.,  comparing  the  likelihood  of  a  given  fitted 
model  to  the  likelihood  of  a  perfectly  fit  model.  MLE  estimation  requires 
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multivariate  normal  distribution  for  the  endogenous  variables  and  Wishart  distribu¬ 
tion  for  the  observed  variance-covariance  matrix,  S. 

Using  MLE  estimation  simplifies  the  objective  function  to: 


f(Z,  S )  =  In 


E  |  +tr(S  x  IT1)  -  In 


S 


-tr(SS~l), 


where  tr()  is  the  trace  of  a  matrix.  The  optimization  of f(Z,  S )  also  requires  independent 
and  identically  distributed  observations  and  positive  definite  matrices,  Z,  S.  The 
iterative  MLE  optimization  generates  estimated  variance-covariance  matrices  and 
path  coefficients  for  the  specified  model.  More  details  on  model  assessment  (using 
Root  Mean  Square  Error  of  Approximation,  RMSEA,  and  Goodness  of  Fit  Index)  and 
the  process  of  defining  a  priori  SEM  hypotheses  are  available  in  Lam  &  Maguire,  2012. 


19.2.2  SEM  Components 

The  R  Lavaan  package  uses  the  following  SEM  syntax,  Table  19.1,  to  represent 
relationships  between  variables.  We  can  follow  the  following  table  to  specify 
Lavaan  models: 

For  example  in  R  we  can  write  the  following  model  model <  - 
'  #  regressions 


y  1  4-  y2  ~  f  1  f  2  x\  4-  x2 
/1-/2+/3 
/2  ~/3  +  x\  +  x2 

#  latent  variable  definitions 

/I  =~  y\  +y2  +y3 
f2  =~  y4  +  y5  +  y6 
/3  =~  yl  +  y8  +  y9  +  ylO 

#  variances  and  covariances 

yl  yl 

yl  y2 
/I  ~~/2 


# intercepts 


yl  ~  1 
/1~1 


! 

/ 


Note  that  the  two  "  "  symbols  (in  the  beginning  and  ending  of  a  model  descrip¬ 
tion)  are  very  important  in  the  R-syntax. 
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Table  19.1  Lavaan  syntax 
for  specifying  the  relations 
between  variables  and  their 
variance-covariance  structure 


Formula  type 

Operator 

Explanation 

Latent  variable  definition 

Is  measured  by 

Regression 

Is  regressed  on 

(Residual)  (co)variance 

Is  correlated  with 

Intercept 

-1 

Intercept 

19.2.3  Case  Study  -  Parkinson ’s  Disease  (PD) 

Let’s  use  the  PPMI  dataset  in  our  class  file  as  an  example  to  illustrate  SEM  model 
fitting. 


Step  1  -  Collecting  Data 

The  Parkinson’s  Disease  Data  represents  a  realistic  simulation  case-study  to  examine 
associations  between  clinical,  demographic,  imaging  and  genetics  variables  for 
Parkinson’s  disease.  This  is  an  example  of  Big  Data  for  investigating  important 
neurodegenerative  disorders. 


Step  2  -  Exploring  and  Preparing  the  Data 


Now,  we  can  import  the  dataset  into  R  and  recode  the  ResearchGroup  variable 
into  a  binary  variable. 

par(mfrow=c(lj  1)) 

PPMI<-read. csv( "https ://umich. instructure. com/ files/ 330397/ down  Load? down  Load 
_frd=l ") 
summary ( PPMI ) 


## 

FID_IID 

L_insuLar_cortex_ComputeArea  L_insuLar_cortex_VoLume 

## 

Min. 

3001 

Min. 

50.03 

Min. 

22. 

63 

## 

1st  Qu. 

■3272 

1st  Qu. : 

1976.88 

1st 

Qu . :  4881 

36 

## 

Median 

■3476 

Median  : 

2498.65 

Median  :  7236. 

76 

## 

Mean 

■3534 

Mean 

2255.20 

Mean 

:  6490. 

84 

## 

3rd  Qu. 

3817 

3rd  Qu. : 

2744.05 

3rd  Qu. :  8405. 

43 

## 

Max. 

■4139 

Max. 

3650.81 

Max. 

: 13499. 

92 

## 

UPDRS_part_I 

UPDRS_part_II 

UPDRS_part_III 

time_visit 

## 

Min. 

0.  000 

Min. 

:  0.000 

Min. 

0. 00 

Min. 

0.  00 

## 

1st  Qu. 

0.  000 

1st  Qu 

.:  2.000 

1st  Qu. 

12.00 

1st  Qu. 

8.25 

## 

Median 

1.000 

Median 

:  5.000 

Median 

20.00 

Median 

21.00 

## 

Mean 

1.286 

Mean 

:  6.087 

Mean 

19.44 

Mean 

23.50 

## 

3rd  Qu. 

2.000 

3rd  Qu 

. :  9.000 

3rd  Qu. 

27.00 

3rd  Qu. 

37.50 

## 

Max. 

13.000 

Max. 

: 28. 000 

Max. 

61.00 

Max. 

54.00 

## 

NA's 

549 

NA's 

:553 

NA's 

554 

PPMI$ResearchGroup<-ifeLse(PPMI$ResearchGroup=="ControL ",  "1 ",  "0") 
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Fig.  19.15  Pair-wise  correlation  structure  of  the  Parkinson’s  disease  (PPMI)  data. 


This  large  dataset  has  1,746  observations  and  31  variables  with  missing  data  in 
some  of  them.  A  lot  of  the  variables  are  highly  correlated.  You  can  inspect  high 
correlation  using  heat  maps ,  which  reorders  these  covariates  according  to  correla¬ 
tions  to  illustrate  clusters  of  high-correlations  (Fig.  19.15). 

pp_heat  <-  PPMI[compLete.cases(PPMI) j  -20] 
corr_mat  =  cor(pp_heat) 

#  Remove  upper  triangle 
corr_mat_Lo\A>er  =  corr_mat 

corr_mat_Lower [upper ,tri( cor  r_mat_Lower) ]  =  NA 

#  Melt  correlation  matrix  and  make  sure  order  of  factor  variables  is  correct 
corr_mat_meited  =  meLt(corr_mat_Lower) 

coLnames(corr_mat_meLted)  <-  c("Vorl"j  "Vor2" ,  "vaLue") 

corr_mat_meited$Varl  =  f actor (cor r_mat_meLted$Varlj  LeveLs=coLnames(corr_mat 
)) 

corr_mat_meited$Var2  =  factor ( cor r_mat_meLted$Var 2,  LeveLs=coLnames(corr_mat 
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#  Plot 

corr_pLot  =  ggpLot(corr_mat_meLtedj  aes(x=Varl,  y=Var2,  fiLL=vaiue))  + 
geom_ti Le( co Lor= 'white ' )  + 

scoLe_fiLL_distiLLer(Limits=c(-lj  l)j  paLette= ' RdBu ' ,  na.value= 'white ' , 

name=' Correlation' )  + 
ggtitLe( ' Correlations  ' )  + 
coord_fixed(ratio=l)  + 
theme_minimaL( )  + 

scale_y_discrete(position= "right ")  + 

theme (axis .text .x=element_text(angle=45,  vjust=l,  hjust=l)j 
axis. tit  Le . x=element_blanh( ) j 
axis. tit  Le . y=element_blanh( ) j 
panel. grid. major=element_blank() , 

Legend. position=c(0. 1,0.9) , 

Legend . j ustification=c (0 , 1)) 
corr_plot 

And  here  are  some  specific  correlations 

cor(PPMI$L_insular_cortex_ComputeArea,  PPMI$L_insuLar_cortex_VoLume ) 

##  [1]  0.9837297 

cor(PPMI$UPDRS_part_I,  PPMI$UPDRS_part_II,  use  =  "complete .obs" ) 

##  [1]  0.4027434 

cor(PPMI$UPDRS_part_IIj  PPMI$UPDRS_part_III,  use  =  "complete . obs" ) 

##  [1]  0.5326681 

One  way  to  solve  this  substantial  multivariate  correlation  issue  is  to  create  some 
latent  variables.  We  can  consider  the  following  model. 

modell<- 

r 

Imaging  =~  L_cingulate_gyrus_ComputeArea  +  L_cingulate_gyrus_Volume+R_c 
ingulate_gyrus_ComputeArea+R_cingulate_gyrus_Volume+R_insular_cortex_Compute 
Area+R_insular_cortex_Volume 

UPDRS=~UPDRS_part_I+UPDRS_part_II+UPDRS_part_III 
DemoGeno  =~  Neight+Sex+Age 

ResearchGroup  ~  Imaging  +  DemoGeno  +  UPDRS 


Here  we  try  to  create  three  latent  variables:  Imaging,  DemoGeno,  and  UPDRS. 
Let’s  fit  a  SEM  model  using  cf  a  ( ) ,  a  confirmatory  factor  analysis  function.  Before 
fitting  the  data,  we  need  to  scale  them.  However,  we  don’t  need  to  scale  our  binary 
response  variable.  We  can  use  the  following  code  for  normalizing  the  data. 

mydata<-scale(PPMI[,  -20]) 

mydata< -data . frame ( my data,  PPMI$ResearchGroup ) 
colnames ( my  data )[31]<- "ResearchGroup " 
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Step  3  -  Fitting  a  Model  on  the  Data 

Now,  we  can  start  to  build  the  model.  The  cf  a  ( )  function  we  will  use  is  part  of  the 
lavaan  package. 


#  install. packages("lavaan") 

Library ( Lavaan) 

fit<-cfa(modeLlj  data=mydataj  missing  =  'FIML') 

Here  we  can  see  some  warning  messages.  Both  our  covariance  and  error  term 
matrices  are  not  positive  definite.  Non-positive  definite  matrices  can  cause  the 
estimates  of  our  model  to  be  biased.  There  are  many  factors  that  can  lead  to  this 
problem.  In  this  case,  we  might  create  some  latent  variables  that  are  not  a  good  fit  for 
our  data.  Let’s  try  to  delete  the  DemoGeno  latent  variable.  We  can  add  Weight, 
Sex,  and  Age  directly  to  the  regression  model. 


modeL2  <- 

i 

#  (1)  Measurement  ModeL 

Imaging  =~  L_cinguLate_gyrus_ComputeArea  +  L_cinguLate_gyrus_VoLume+R_cing 
uLate_gyrus_ComputeArea+R_cinguLate_gyrus_VoLume+R_insuLar_cortex_ComputeAre 
a+R_insuLar_cortex_VoLume 

UPDRS  =~  UPDRS_part_I  +UPDRS_part_II  +  UPDRS_part_III 

#  (2)  Regressions 

ResearchGroup  ~  Imaging  +  UPDRS  +Age+Sex+lAleight 

i 


When  fitting  model 2,  the  warning  messages  are  gone.  We  can  see  that  falsely 
adding  a  latent  variable  can  cause  those  matrices  to  be  not  positive  definite.  Cur¬ 
rently,  the  lavaan  functions  sem  ( )  and  cf  a  ( )  are  the  same. 

fit<-cfa(modeL2j  data=mydataj  missing  =  'FIML') 
summary ( fit j  fit .measures=TRUE) 

##  Lavaan  (0.5-23.1097)  converged  norma LLy  after  107  iterations 


## 

## 

Number  of  observations 

1764 

## 

## 

Number  of  missing  patterns 

4 

## 

## 

Estimator 

ML 

## 

Minimum  Function  Test  Statistic 

7714.119 

## 

Degrees  of  freedom 

60 

## 

P-vaLue  (Chi-square) 

0.  000 

## 

##  ModeL  test  baseLine  modeL: 

## 

##  Minimum  Function  Test  Statistic 

30237. 866 

## 

Degrees  of  freedom 

75 

## 

P-vaLue 

0.  000 
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## 

##  User  modeL  versus  baseline  model: 
## 


## 

Comparative  Fit  Index  (CFI) 

0.746 

## 

Tucker- Lewis  Index  (TLI) 

0.683 

## 

## 

Loglikelihood  and  Information  Criteria: 

## 

## 

Loglikelihood  user  model  (FI0) 

NA 

## 

Loglikelihood  unrestricted  model  (HI) 

NA 

## 

## 

Number  of  free  parameters 

35 

## 

Akaike  (AIC) 

NA 

## 

Bayesian  (BIC) 

NA 

## 

##  Root  Mean  Square  Error  of  Approximation : 

## 

##  RMSEA  0.269 

##  90  Percent  Confidence  Interval  0.264  0.274 

##  P-value  RMSEA  <=  0.05  0.000 

## 

##  Standardized  Root  Mean  Square  Residual: 

## 

##  SRMR  0.052 

## 

##  Parameter  Estimates : 


## 

##  Information  Observed 
##  Standard  Errors  Standard 
## 


##  Latent  Variables : 


## 

Estimate 

Std. Err 

z-value 

P(>lzl) 

## 

Imaging  =~ 

## 

L_cnglt_gyr_CA 

1.000 

## 

L_cnglt_gyrs_V 

0.994 

0.  004 

260.366 

0.  000 

## 

R_cnglt_gyr_CA 

0.961 

0.007 

134.531 

0.  000 

## 

R_cnglt_gyrs_V 

0.955 

0.008 

126.207 

0.  000 

## 

R_nslr_crtx_CA 

0.930 

0.009 

101.427 

0.  000 

## 

R_nslr_crtx_Vl 

0.920 

0.010 

94.505 

0.  000 

## 

UPDRS  =~ 

## 

UPDRS_part_I 

1.000 

## 

UPDRS_part_II 

1.890 

0.177 

10.699 

0.  000 

## 

UPDRS_part_III 

2.345 

0.248 

9.468 

0.  000 

## 

##  Regressions : 

## 

Estimate 

Std. Err 

z-value 

P(>\z\) 

## 

ResearchGroup  ~ 

## 

Imaging 

0.008 

0.010 

0.788 

0.431 

## 

UPDRS 

-0.828 

0.080 

-10.299 

0.  000 

## 

Age 

0.019 

0.009 

2.121 

0.034 

## 

Sex 

-0.010 

0.010 

-0.974 

0.330 

## 

Height 

0.005 

0.010 

0.481 

0.631 
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## 

##  Covariances : 


## 

Estimate 

Std. Err 

z-vaiue 

P(>lzl) 

## 

Imaging  — 

## 

UPDRS 

0.059 

0.014 

4.361 

0.  000 

## 

##  Intercepts : 

## 

Estimate 

Std. Err 

z-vaLue 

P(>lzl) 

## 

.  L_cngit_gyr_CA 

-0. 000 

0.024 

-0.001 

1.000 

## 

. L_ eng L t_gyrs_ V 

-0. 000 

0.024 

-0.001 

1.000 

## 

.  /?_  eng  L  t_gyr_CA 

-0. 000 

0.024 

-0.001 

1.000 

## 

.  /?_  eng  L  t_gyrs_  V 

-0. 000 

0.024 

-0.001 

1.000 

## 

.  R_nsLr_crtx_CA 

-0. 000 

0.024 

-0.001 

1.000 

## 

.  R_nsLr_crtx_VL 

-0. 000 

0.024 

-0.001 

1.000 

## 

.  UPDRS_part_I 

-0.135 

0.032 

-4.225 

0.  000 

## 

.  UPDRS_part_II 

-0.255 

0.033 

-7.621 

0.  000 

## 

. UPDRS_part_III 

-0.317 

0.034 

-9.181 

0.  000 

## 

. ResearchGroup 

1.290 

0.011 

119.239 

0.  000 

## 

Imaging 

0.  000 

## 

UPDRS 

0.000 

## 

## 

Variances : 

## 

Estimate 

Std. Err 

z-vaiue 

POlzl) 

## 

.  L_cngit_gyr_CA 

0.  006 

0.001 

9.641 

0.  000 

## 

.  L_cngit_gyrs_ V 

0.019 

0.001 

23.038 

0.  000 

## 

.  /?_  eng  L  t_gyr_CA 

0.083 

0.003 

27.917 

0.  000 

## 

.  /?_  eng  L  t_gyrs_  V 

0.093 

0.003 

27.508 

0.  000 

## 

.  R_nsLr_crtx_CA 

0.141 

0.005 

28.750 

0.  000 

## 

.  R_nsLr_crtx_VL 

0.159 

0.  006 

28.728 

0.  000 

## 

.  UPDRS_part_I 

0.877 

0.038 

23.186 

0.  000 

## 

. UPDRS _ part_II 

0.561 

0.033 

16.873 

0.000 

## 

. UPDRS_part_III 

0.325 

0.036 

9.146 

0.  000 

## 

. ResearchGroup 

0.083 

0.  006 

14.808 

0.  000 

## 

Imaging 

0.993 

0.034 

29.509 

0.  000 

## 

UPDRS 

0.182 

0.035 

5.213 

0.  000 

19.2.4  Outputs  of  Lavaan  SEM 

In  the  output  of  our  model,  we  have  information  about  how  to  create  these  two  latent 
variables  (Imaging,  UPDRS)  and  the  estimated  regression  model.  Specifically,  it 
gives  the  following  information. 

1 .  First  six  lines  are  called  the  header  contains  the  following  information: 

•  Lavaan  version  number. 

•  Lavaan  convergence  information  (normal  or  not),  and  #number  of  iterations 
needed. 

•  The  number  of  observations  that  were  effectively  used  in  the  analysis. 

•  The  estimator  that  was  used  to  obtain  the  parameter  values  (here:  ML). 

•  The  model  test  statistic,  the  degrees  of  freedom,  and  a  corresponding  p-value. 

2.  Next,  we  have  the  Model  test  baseline  model  and  the  value  for  the  SRMR 
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3.  The  last  section  contains  the  parameter  estimates,  standard  errors  (if  the  informa¬ 
tion  matrix  is  expected  or  observed,  and  if  the  standard  errors  are  standard,  robust, 
or  based  on  the  bootstrap).  Then,  it  tabulates  all  free  (and  fixed)  parameters  that 
were  included  in  the  model.  Typically,  first  the  latent  variables  are  shown, 
followed  by  covariances  and  (residual)  variances.  The  first  column  (Estimate) 
contains  the  (estimated  or  fixed)  parameter  value  for  each  model  parameter;  the 
second  column  (Std.err)  contains  the  standard  error  for  each  estimated  parameter; 
the  third  column  (Z- value)  contains  the  Wald  statistic  (which  is  simply  obtained 
by  dividing  the  parameter  value  by  its  standard  error);  and  the  last  column 
contains  the  p-value  for  testing  the  null  hypothesis  that  the  parameter  equals 
zero  in  the  population. 


19.3  Longitudinal  Data  Analysis-Linear  Mixed  Models 

As  mentioned  earlier,  longitudinal  studies  take  measurements  for  the  same  individ¬ 
ual  repeatedly  through  a  period  of  time.  Under  this  setting,  we  can  measure  the 
change  after  a  specific  treatment.  However,  the  measurements  for  the  same  individ¬ 
ual  may  be  correlated  with  each  other.  Thus,  we  need  special  models  that  deal  with 
this  type  of  internal  multivariate  dependencies. 

If  we  use  the  latent  variable  UPDRS  (created  in  the  output  of  SEM  model)  rather 
than  the  research  group  as  our  response  we  can  obtain  a  longitudinal  analysis  model. 
In  longitudinal  analysis,  time  is  often  an  important  model  variable. 


19.3.1  Mean  Trend 

According  to  the  output  of  model  fit,  our  latent  variable  UPDRS  is  a  combination 
of  three  observed  variables-UPDRS_part_I,  UPDRS_part_II,  and 
UPDRS_part_III.  We  can  visualize  how  average  UPDRS  values  differ  among 
the  research  groups  over  time. 

mydoto$UPDRS<-mydoto$UPDRS_port_I+l . 890*mydata$UPDRS_part_II+2 . 345*mydata$UP 
DRS_part_III 

mydata$Imaging<-mydata$L_cinguLate_gyrus_ComputeArea  +0. 994*mydata$L_cinguL 
ate_gyrus_VoLume+0. 961  *mydata$R_cinguLate_gyrus_ComputeAreci+0. 955*mydata$R_c 
inguLate_gyrus_VoLume+0. 930*mydata$R_insuLcir_cortex_ComputeArea+0. 920*mydata 
$R_insuLar_cortex_VoLume 

The  above  code  stores  the  latent  UPDRS  and  Imaging  variables  into  mydata. 
By  now,  we  are  experienced  with  using  the  package  ggplot2  for  data  visualiza¬ 
tion.  Now,  we  will  use  it  to  set  the  v  and  y  axes  as  time  and  UPDRS,  and  then 
display  the  trend  of  the  individual  level  UPDRS . 
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Fig.  19.16  Average  UPDRS  scores  of  the  two  cohorts  in  the  PPMI  dataset,  patients  (1)  and 
controls  (0) 


require ( ggp Lot  2 ) 

p<-ggpLot(dota=mydotOj  aes(x=time_visitj  y=UPDRSj  group=FID_IID) ) 
dev.offO 

p+geom_point ( ) +geom_  L ine ( ) 

This  graph  is  a  bit  messy  without  a  clear  pattern  emerging.  Let’s  see  if  group-level 
graphs  may  provide  more  intuition.  We  will  use  the  aggregate  ( )  function  to  get 
the  mean,  minimum  and  maximum  of  UPDRS  for  each  time  point.  Then,  we  will  use 
separate  color  for  the  two  research  groups  and  examine  their  mean  trends 
(Fig.  19.16). 

ppmi.mean< -aggregate (UPDRS~time_visit+ResearchGroupj  FUN  =  mearij  data= 
my  data [j  c(30}  31 ,  32)]) 

ppmi. min< -aggregate (UPDRS~time_visit+ResearchGroupj  FUN  =  mirij  data= 
mydata[j  c(30,  31 }  32)]) 

ppmi.max<-aggregate(UPDRS~time_visit+ResearchGroupj  FUN  =  maXj  data= 
my  data [j  c(30,  31 ,  32)]) 

ppmi . boundary < -merge (ppmi . min  j  ppmi . max j  by=c ( "time_visit " j  "ResearchGroup" ) ) 
ppmi. aLL< -merge (ppmi. mean j ppmi. boundary j by=c ( "time_visit " j  "ResearchGroup" ) ) 
pd  <-  position_dodge(0. 1) 

pl<-ggpLot(data=ppmi.aLLj  aes(x=time_\/isitJ  y=UPDRSj  group=ResearchGroupj 
coLour=ResearchGroup) ) 

pl+geom_errorbar(aes(ymin=UPDRS.Xj  ymax=UPDRS .y}  width=0.1) ) +geom_point ( )  + 
geom_Line() 
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Despite  slight  overlaps  in  some  lines,  the  resulting  graph  illustrates  better  the 
mean  differences  between  the  two  cohorts.  The  control  group  (1)  appears  to  have 
relative  lower  means  and  tighter  ranges  compared  to  the  PD  patient  group  (0). 
However,  we  need  further  data  interrogation  to  determine  if  this  visual  (EDA) 
evidence  translates  into  statistically  significant  group  differences. 

Generally  speaking  we  can  always  use  the  General  Linear  Modeling  (GLM) 
framework.  However,  GLM  may  ignore  the  individual  differences.  So,  we  can  try 
to  fit  a  Linear  Mixed  Model  (LMM)  to  incorporate  different  intercepts  for  each 
individual  participant.  Consider  the  following  GLM: 

UPDRSij  ~  (3{)  +  (3  i  *  Imaging  fj  +  (32  ^ResearchGroupf  +  /33*timeVisitj 

+  /34*ResearchGroupi*timevisitj  +  [35  *Aget  +  /?6  *  Sexj 
+  /?7  ^  Weight i  -f-  Cfj . 

If  we  fit  a  different  intercept,  bh  for  each  individual  (indicated  by  LID_IID),  we 
obtain  the  following  LMM  model: 

UPDRS  jj  A)  +  /?]  *  Imaging  +  / 32  *ResearchGroup  +  / 33*timeVisitj 

+  /34  ^ResearchGroupi  *  time  Vis itj  +  [35  *Age{  +  / 36*SeXi 
T"  /3j  ^  Weight i  T*  b[  -f-  . 

The  LMM  actually  has  two  levels: 

Stage  1 


Yi  =  Zifii  + 

where  both  Z*  and  j3t  are  matrices. 

Stage  2 

The  second  level  allows  fitting  random  effects  in  the  model. 


^  —  Aj*f3  +  bi. 

So,  the  full  model  in  matrix  form  would  be: 

Yi  =  Xf^/3  +  Zi*bi  +  £/. 

In  this  case  study,  we  only  consider  random  intercept  and  avoid  including  random 
slopes,  however  the  model  can  indeed  be  extended.  In  other  words,  Zt  =  1  in  our 
simple  model.  Let’s  compare  the  two  models  (GLM  and  LMM).  One  R  package 
implementing  LMM  is  lme4. 


#install . packages("lme4" ) 

#install . packages ("arm" ) 

Library ( Lme4) 

Library (arm) 

#GLM 

modeL .  gLm<-gLm(UPDRS~Imaging+ResearchGroup*time_visit+Age+Sex+lAleightj  data=my 
data) 

summary (mode  L . g  Lm) 
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## 

##  CaLL: 

##  gLm(formuLa  =  UPDRS  ~  Imaging  +  ResearchGroup  *  time_visit  + 
##  Age  +  Sex  +  Weight j  data  =  my  data) 

## 

##  Deviance  Residua Ls : 


##  Min  IQ  Median 

3Q 

Max 

##  -7.6065  -2.4581  -0.3159 

1 . 8328 

14.9746 

## 

##  Coefficients : 

## 

Estimate 

Std.  Error  t 

value 

Pr(> It  1 ) 

##  (Intercept) 

0. 70000 

0.10844 

6.455 

1 . 57e-10  *** 

##  Imaging 

0.03834 

0.01893 

2.025 

0.0431  * 

##  ResearchGroupl 

-6.93501 

0.33445  - 

20.736 

<  2e-16  *** 

##  time_visit 

0.05077 

0.10843 

0.468 

0.6397 

##  Age  ' 

0.54171 

0.10839 

4.998 

6. 66e-07  *** 

##  Sex 

0.16170 

0.11967 

1.351 

0.1769 

##  Weight 

0.20980 

0.11707 

1.792 

0.0734  . 

##  ResearchGroupl :time_visit 

-0.06842 

0.32970 

-0.208 

0.8356 

##  --- 

##  Signif.  codes:  0  '***'  0. 

001  '**' 

0.01  '*' 0.05  0.1  '  '  1 

## 

##  (Dispersion  parameter  for 

gaussian 

family  taken 

to  be 

12.58436) 

## 

##  Null  deviance:  21049 

on  1205 

degrees  of  freedom 

##  Residual  deviance:  15076 

on  1198 

degrees  of  freedom 

##  (558  observations  deleted  due  to  missingness) 

##  AIC:  6486.6 
## 

##  Number  of  Fisher  Scoring  iterations :  2 

mode i . imm< - Lmer ( UPDRS~Imaging+ResearchGroup*time_  visi t+Age+Sex+Weight+ ( time_ 
visit  I FID_IID) j  data=mydata) 
summary (model.  Lmm ) 

##  Linear  mixed  model  fit  by  REML  [ '  LmerMod '  ] 

##  Formula: 

##  UPDRS  ~  Imaging  +  ResearchGroup  *  time_visit  +  Age  +  Sex  +  Weight  + 

##  (time_visit  /  FID_IID) 

##  Data:  my data 
## 

##  REML  criterion  at  convergence :  5737.9 
## 

##  Scaled  residuals : 

##  Min  IQ  Median  3Q  Max 

##  -3.2660  -0.4617  -0.0669  0.3575  4.6158 

## 

##  Random  effects: 


## 

Groups 

Name 

Variance 

Std. Dev. 

Corr 

## 

FID_IID 

(Intercept) 

7.8821 

2.8075 

## 

time_ 

visit 

0.2454 

0.4954 

0.16 

## 

Residual 

3.1233 

1 . 7673 

##  Number  of  obs: 

1206 , 

groups : 

FID_IIDj 

440 

## 

##  Fixed  effects: 
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## 


Estimate  Std.  Error  t  value 


##  (Intercept) 

##  Imaging 
##  ResearchGroupl 
##  time_visit 
##  Age 
##  Sex 
##  Weight 

##  ResearchGroupl :time_visit 


0.69803 
0. 04200 
-6.93136 
0.02799 
0.47720 
0.18662 
0.24146 
-0.04785 


0.16881  4.135 
0.02669  1.574 
0.34425  -20.135 
0.06385  0.438 
0.15065  3.168 
0.17212  1.084 
0.17075  1.414 
0.30496  -0.157 


## 

##  Correlation  of  Fixed  Effects: 

##  (Intr)  Imagng  RsrcGl 

##  Imaging  -0.059 
##  ReserchGrpl  -0.496  0.101 

##  time_visit  0.067  -0.002  -0.033 
##  Age  -0.028  0.128  0.045 

##  Sex  -0.029  0.014  0.048 

##  Weight  -0.015  0.046  0.022 

##  RsrchGrpl -0.011  -0.053  -0.001 


tm_vst  Age  Sex 


0.002 

0.006  0.140 

0.006  0.125  0.522 

-0.209  -0.010  -0.005 


Weight 


0.  000 


display (model.  Lmm) 


##  Lmer( formula  =  UPDRS  ~  Imaging  +  ResearchGroup  * 
##  Age  +  Sex  +  Weight  +  (time_visit  /  FID_IID)j 


## 

coef.est  coef.. 

##  (Intercept) 

0.70 

0.17 

##  Imaging 

0.04 

0.03 

##  ResearchGroupl 

-6.93 

0.34 

##  time_visit 

0.03 

0.06 

##  Age 

0.48 

0.15 

##  Sex 

0.19 

0.17 

##  Weight 

0.24 

0.17 

##  ResearchGroupl :time_visit 

-0.05 

0.30 

## 


time_visit  + 
data  =  my data) 


##  Error  terms: 

##  Groups  Name  Std. Dev.  Corr 

##  FID_IID  (Intercept)  2.81 

##  time_visit  0.50  0.16 

##  Residual  1.77 

##  --- 

##  number  of  obs:  1206 ,  groups:  FID_IIDj  440 
##  AIC  =  5761.9 ,  DIC  =  5702.5 
##  deviance  =  5720.2 


Note  that  we  use  the  notation  ResearchGroup  *  t  ime_vi  sit  that  is  identical 
to  ResearchGroup  +  time_visit  +  ResearchGroup*time_visit. 
Here  R  will  include  both  terms  and  their  interaction  into  the  model.  According  to 
the  model  outputs,  the  LMM  model  has  a  relatively  smaller  AIC.  In  terms  of  AIC, 
LMM  may  represent  a  better  model  fit  than  GLM. 


19.3.2  Modeling  the  Correlation 

In  the  summary  of  the  LMM  model,  we  can  see  a  section  called  Correlation  of 
Fixed  Effects.  The  original  model  made  no  assumption  about  the  correlation 
(unstructured  correlation).  In  R,  we  usually  have  the  following  4  types  of  correlation 
models. 
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•  Independence:  No  correlation: 

1  0  0 
0  1  0 
0  0  1 

•  Exchangeable:  Correlations  are  constant  across  measurements: 

i  p  p 
p  i  p 
p  p  i 

•  Autoregressive  order  1(AR(1)):  Correlations  are  stronger  for  closer  measure¬ 
ments  and  weaker  for  more  distanced  measurements: 

i  p  p2 
p  i  p 
p 2  p  i 

•  Unstructured:  Correlation  is  different  for  each  occasion: 

/  1  Pi, 2  Pi, 3  \ 

I  Pi, 2  1  P2,3  I  • 

\Pl,3  P2 ,3  1  J 

In  the  LMM  model,  the  output  also  seems  unstructured.  So,  we  needn’t  worry 
about  changing  the  correlation  structure.  However,  if  the  output  under  unstructured 
correlation  assumption  looks  like  an  Exchangeable  or  AR(1)  structure,  we  may 
consider  changing  the  LMM  correlation  structure  accordingly. 


19.4  GLMM/GEE  Longitudinal  Data  Analysis 

If  the  response  is  a  binary  variable  like  ResearchGroup,  we  need  to  use  General 
Linear  Mixed  Model  (GLMM)  instead  of  LMM.  The  marginal  model  of  GLMM  is 
called  GEE.  However,  GLMM  and  GEE  are  actually  different. 

In  situations  where  the  responses  are  discrete,  there  may  not  be  a  uniform  or 
systematic  strategy  for  dealing  with  the  joint  multivariate  distribution  of  Yt  —  {(Ya, 

rr 

Yt 2, . . .,  Yin)}  ,  .  That’s  where  the  GEE  method  comes  into  play  as  it’s  based  on  the 
concept  of  estimating  equations.  It  provides  a  general  approach  for  analyzing 
discrete  and  continuous  responses  with  marginal  models. 

GEE  is  applicable  when: 

1.  /?,  a  generalized  linear  model  regression  parameter,  characterizes  systematic 
variation  across  covariate  levels, 
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2.  The  data  represents  repeated  measurements,  clustered  data,  multivariate  response, 
and 

3.  The  correlation  structure  is  a  nuisance  feature  of  the  data. 

Notation 

•  Response  variables:  {Yitu  Yii2, . . . ,  Y^nt},  where  i  E  [1,2V]  is  the  index  for 
clusters  or  subjects,  and  j  E  [1  ,nt]  is  the  index  of  the  measurement  within 
cluster/subject. 

•  Covariate  vector:  {Xu  t,Xi2.  ,  Xu  } . 

The  primary  focus  of  GEE  is  the  estimation  of  the  mean  model:  E(Yt  j\  Xit  J)  = 
Vu  p  where 

s(Pij)  =  Po  +  Pi +/^2^,;(2)  +/^3^j(3)  +  •  •  •  +  PpXi,j(p)  =  x  p. 

This  mean  model  can  be  any  generalized  linear  model.  For  example:  P(Yit  j  = 
1 1  Xu  j)  =  /r/?  j  (marginal  probability,  as  we  don’t  condition  on  any  other  variables): 


= ln  =  XiJ  x  /l 


Since  the  data  could  be  clustered  (e.g.,  within  subject,  or  within  unit),  we  need  to 
choose  a  correlation  model.  Let’s  introduce  some  notation: 


Vij  =  var(Y,,j \Xt), 

At  =  diag(Vij), 

the  paired  correlations: 

Pij,k  =  corr(YUj,  Yuk\X,), 

the  correlation  matrix: 


Ri  =  (Pij,k)’ for  ally  and  k, 
and  the  paired  predictor-response  covariances  are: 

Vi  =  cov{Yi\Xi)  =A)/2RiAyJ2. 

Assuming  different  correlation  structures  in  the  data  leads  to  alternative  models, 
see  the  examples  above. 

Notes 

•  GEE  is  a  semi-parametric  technique  because: 

-  The  specification  of  a  mean  model,  fi^  fp\  and  a  correlation  model,  Ri(a),  does 
not  identify  a  complete  probability  model  for  Yt 
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-  The  model  {fi^  //?),  Ri(a) }  is  semi-parametric  since  it  only  specifies  the  first 
two  multivariate  moments  (mean  and  covariance)  of  Yr  Higher  order  moments 
are  not  specified. 

•  Without  an  explicit  likelihood  function,  to  estimate  the  parameter  vector  /?  (and 
perhaps  the  covariance  parameter  matrix  Ri(a ))  and  perform  a  valid  statistical 
inference  that  takes  the  dependence  into  consideration,  we  need  to  construct  an 
unbiased  estimating  function: 

•  Di(fi)  =  the  partial  derivative,  w.r.t.  /?,  of  the  mean-model  for  subject  i. 

•  Di(j,  k)  =  -jjj1,  the  partial  derivative,  w.r.t.  /?, ,  the  partial  derivative,  w.r.t.  the  kth 

regression  coefficient  (J3k),  of  the  mean-model  for  subject  i  and  measurement 
(e.g.,  time-point)  j. 

Estimating  (cost)  function: 


u{fi)  =  YsDJWVi'iMiYi-vm}- 

i=  1 

Solving  the  Estimating  Equations  leads  to  parameter  estimating  solutions: 

0  =  u(p)  -  E  ?!$).  jJi-Pitf)}.- 

scale  variance  weight  model  mean 

Scale:  A  change  of  scale  term  transforming  the  scale  of  the  mean,  juh  to  the  scale 
of  the  regression  coefficients  (co variates). 

Variance  weight:  The  inverse  of  the  variance-covariance  matrix  is  used  to 
weight  in  the  data  for  subject  /,  i.e.,  giving  more  weight  to  differences  between 
observed  and  expected  values  for  subjects  that  contribute  more  information. 

Model  Mean:  Specifies  the  mean  model,  //*(/?),  compared  to  the  observed  data,  Yt. 
This  fidelity  term  minimizes  the  difference  between  actually-observed  and  mean- 
expected  (within  the  ith  cluster/subject).  See  also  the  SMHS  EBook. 


19.4.1  GEE  Versus  GLMM 

There  is  a  difference  in  the  interpretation  of  the  model  coefficients  between  GEE  and 
GLMM.  The  fundamental  difference  between  GEE  and  GLMM  is  in  the  target  of  the 
inference:  population-average  vs.  subject- specific.  For  instance,  consider  an  example 
where  the  observations  are  dichotomous  outcomes  (Y),  e.g.,  single  Bernoulli  trials  or 
death/survival  of  a  clinical  procedure,  that  are  grouped/clustered  into  hospitals  and 
units  within  hospitals,  with  N  additional  demographic,  phenotypic,  imaging  and 
genetics  predictors.  To  model  the  failure  rate  between  genders  (males  vs.  females)  in 
a  hospital,  where  all  patients  are  spread  among  different  hospital  units  (or  clinical 
teams),  let  Y  represent  the  binary  response  (death  or  survival). 
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In  GLMM,  the  model  will  be  pretty  similar  with  the  LMM  model. 


(^=1) 

\P(Yij  =  0) 


Xij,  bi 


—  A)  +  P\xij  +  bi  +  Cij. 


The  only  difference  between  GLMM  and  LMM  in  this  situation  is  that  GLMM 
used  a  logit  link  for  the  binary  response. 

With  GEE,  we  don’t  have  random  intercept  or  slope  terms. 


l°g  _  0j  I Xy,  b\  —  Pq+  P\Xij  +  tij. 

In  the  marginal  model  (GEE),  we  are  ignoring  differences  among  hospital-units 
and  just  aim  to  obtain  population  (hospital-wise)  rates  of  failure  (patient  death)  and 
its  association  with  patient  gender.  The  GEE  model  fit  estimates  the  odds  ratio 
representing  the  population-averaged  (hospital-wide)  odds  of  failure  associated 
with  patient  gender. 

/\ 

Thus,  parameter  estimates  (fi )  from  GEE  and  GLMM  models  may  differ  because 
they  estimate  different  things. 

Let’s  compare  the  results  of  the  GLM  and  GLMM  models  for  our  PPMI  dataset. 


#  install . packages( "gee" ) 

Library (gee) 

model . glml<-gLm(ResearchGroup~UPDRS+Imaging+Age+Sex+Weightj  data  =  mydataj  f 

amity =" binomial ") 

model. glml 
## 

##  Call:  glm(formula  =  ResearchGroup  ~  UPDRS  +  Imaging  +  Age  +  Sex  +  Weight 


##  fa  mi  Ly  = 

## 

##  Coefficients : 

"binomial" j  data  =  mydata) 

##  (Intercept) 

UPDRS 

Imaging 

Age 

Sex 

##  -10.64144 
##  Weight 
##  0.40606 

## 

-1.96707 

0.03889 

0.71562 

0.19361 

##  Degrees  of  Freedom:  1205  Total  (i.e.  Null);  1200  Residual 
##  (558  observations  deleted  due  to  missingness) 

##  Null  Deviance:  811.9 

##  Residual  Deviance:  195.8  AIC:  207.8 

#mydatal<-na .omit(mydata) 

#attach(mydatal) 

#model .gee<- gee (Resea rchGroup~L_insu la r_cortex_ComputeArea+L_insular_cortex_ 
Volume+  Sex  +  Weight  +  Age  +  chrl7_rsll012_GT  +  chrl7_rsl99533_GT  +  UPDRS_pa 
rt_I  +  UPDRS_part_II  +  time_visit,  id=FID_IID,  data  =  mydatal,  family=binomi 
al(link  =  logit)) 
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mode L .  gLmm< -gLmer  ( Resear chGroup~UPDRS+Imaging+Age+Sex+Weight+  (l\  FID_IID ) ,  da 
ta=mydataj  famiLy="binomiaL") 
dispLay(modeL . gLmm) 


##  gLmer(formuLa  =  ResearchGroup  ~  UPDRS  +  Imaging  +  Age  +  Sex  + 
##  Weight  +  (1  /  FID_IID)j  data  =  my  data j  family  =  "binomial" ) 
##  coef.  est  coef.  se 


##  (Intercept) 

-86.63 

32.07 

##  UPDRS 

-16.78 

6.27 

##  Imaging 

0.59 

0.61 

##  Age 

6.04 

2.41 

##  Sex 

0.65 

2.15 

##  Weight 

6.12 

3.76 

## 

##  Error  terms: 

##  Groups  Name 

Std.Di 

##  FID_IID  (Intercept)  40.72 
##  Residual  1.00 

##  --- 


##  number  of  obs:  1206 ,  groups:  FID_IIDj  440 
##  AIC  =  129.5 ,  DIC  =  -114.1 
##  deviance  =0.7 


In  terms  of  AIC,  the  GLMM  model  is  a  lot  better  than  the  GLM  model. 

Try  to  apply  some  of  these  longitudinal  data  analytics  on  the  fMRI  data  we 
discussed  in  Chap.  4  (Visualization). 


19.5  Assignment:  19.  Big  Longitudinal  Data  Analysis 
19.5.1  Imaging  Data 

Review  the  3D/4D  MRI  imaging  data  discussion  in  Chap.  4.  Extract  the  time  courses 
of  several  time  series  at  different  3D  spatial  locations,  some  near-by,  and  some 
farther  apart  (distant  voxels).  Then,  apply  time-series  analyses,  report  findings, 
determine  if  near-by  or  farther-apart  voxels  may  be  more  correlated. 

Example  of  extracting  time  series  from  4D  fMRI  data: 


#See  examples  here:  https://cran.r-project.org/web/packages/oro.nifti/vignettes/nifti.pdf 

fMRIURL  < -  "http ://socr . umich . edu/HTML5/BrainViewer/data/fMRI_FilteredData_4D. nii.gz" 
fMRI File  <-  file.path(tempdir()j  "fMRI_FilteredData_4D. nii.gz") 
download .file (fMRIURL j  dest=fMRIFilej  quiet=TRUE) 

(fMRIVolume  <-  readNIfTI(fMRIFile}  reorient=FALSE) ) 

#  dimensions :  64  x  64  x  21  x  180  ;  4mm  x  4mm  x  6mm  x  3  sec 

fMRIVoLDims  <-  dim( fMRIVolume) j  fMRIVoLDims 
time_dim  <-  fMRIVoLDims [4] ;  time_dim 


hist (fMRIVolume) 


658 


19  Big  Longitudinal  Data  Analysis 


#  To  examine  the  time  course  of  a  specific  3D  voxel  (say  the  one  at  x=30,  y=30,  z=15): 
plot(fMRIVolume[30,  30,  10,],  type='L',  main="Time  Series  of  3D  Voxel  \n  (x=30,  y=30,  z=l 
5 )",  co L  =  ” blue ") 

xl  <-  c (1:180) 

yl  <-  loess (fMRIVolume [30,  30,  10,  J~  xl,  family  =  "gaussian" ) 

Lines(xl,  smooth (fMRIVolume [30,  30,  10,]),  col  =  "red",  Lwd  =  2) 

Lines (ksmooth(xl,  fMRIVolume[30,  30,  10,],  kernel  =  "normal",  bandwidth  =5),  col  =  "gree 
n",  Lwd  =  3) 


19.5.2  Time  Series  Analysis 


Use  Google  Web-Search  Trends  and  Stock  Market  Data  to: 

•  Plot  time  series  for  the  variable  Job. 

•  Apply  TTR  to  smooth  the  original  graph  by  month. 

•  Determine  the  differencing  parameter. 

•  Decide  the  auto-regressive  (AR)  and  moving  average  (MA)  parameters. 

•  Build  an  ARIMA  model,  forecast  the  Job  variable  over  the  next  year  and 
evaluate  this  model. 


19.5.3  Latent  Variables  Model 

Use  the  Hand  written  English  Letters  data  to: 

•  Explore  the  data  and  evaluate  the  correlations  between  covariates. 

•  Justify  the  application  of  a  latent  variable  model. 

•  Apply  proper  data  conversion  and  scaling. 

•  Fit  a  Structural  Equation  Model  (SEM)  using  the  lavaan::cfa()  function  for  these 
data  by  adding  proper  latent  variable. 

•  Summarize  and  interpret  the  outputs. 

•  Use  the  model  you  found  above  to  fit  GEE  and  GLMM  models  using  the  latent 
variable  as  response  and  compare  the  models  using  AIC.  (Hint:  add  a  fake 
variable  as  random  effect  for  GLMM). 
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As  we  have  seen  in  the  previous  chapters,  traditional  statistical  analyses  and  classical 
data  modeling  are  applied  to  relational  data  where  the  observed  information  is 
represented  by  tables,  vectors,  arrays,  tensors,  or  data-frames  containing  binary, 
categorical,  original,  or  numerical  values.  Such  representations  provide  incredible 
advantages  (e.g.,  quick  reference  and  de-reference  of  elements,  search,  discovery, 
and  navigation),  but  also  limit  the  scope  of  applications.  Relational  data  objects  are 
quite  effective  for  managing  information  that  is  based  only  on  existing  attributes. 
However,  when  data  science  inference  needs  to  utilize  attributes  that  are  not  included 
in  the  relational  model,  alternative  non-relational  representations  are  necessary.  For 
instance,  imagine  that  our  data  object  includes  a  free  text  feature  (e.g.,  physician/ 
nurse  clinical  notes,  biospecimen  samples)  that  contains  information  about  medical 
condition,  treatment  or  outcome.  It’s  very  difficult,  or  sometimes  even  impossible,  to 
include  the  raw  text  into  the  automated  data  analytics,  using  classical  procedures  and 
statistical  models  available  for  relational  datasets. 

Natural  Language  Processing  (NLP)  and  Text  Mining  (TM)  refer  to  automated 
machine-driven  algorithms  for  semantically  mapping,  extracting  information,  and 
understanding  of  (natural)  human  language.  Sometimes,  this  involves  extracting 
salient  information  from  large  amounts  of  unstructured  text.  To  do  so,  we  need  to 
build  semantic  and  syntactic  mapping  algorithms  for  effective  processing  of  heavy 
text.  Related  to  NLP/TM,  the  work  we  did  in  Chap.  8  showed  a  powerful  text 
classifier  using  the  naive  Bayes  algorithm. 

In  this  Chapter,  we  will  present  more  details  about  various  text  processing 
strategies  in  R.  Specifically,  we  will  present  simulated  and  real  examples  of  text 
processing  and  computing  document  term  frequency  (TF),  inverse  document  fre¬ 
quency  (IDF),  cosine  similarity  transformation,  and  machine  learning  based  senti¬ 
ment  analysis. 

Live  demos  will  show  various  NLP  tasks  directly  in  the  browser  intermediate 
stages  of  processing,  as  well  as  full  text  processing  performing  complete  text 
analysis. 


©  Ivo  D.  Dinov  2018 

I.  D.  Dinov,  Data  Science  and  Predictive  Analytics, 
https://doi.org/10.1007/978-3-319-72347-l_20 
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20.1  A  Simple  NLP/TM  Example 

Text  mining  or  text  analytics  (TM/TA)  examines  large  volumes  of  unstructured  text 
(corpus),  aiming  to  extract  new  information,  discover  context,  identify  linguistic 
motifs,  or  transform  the  text  into  a  structured  data  format  leading  to  derived 
quantitative  data  that  can  be  further  analyzed.  Natural  language  processing  (NLP) 
is  one  example  of  a  TM  analytical  technique.  Whereas  TM’s  goal  is  to  discover 
relevant  contextual  information,  which  may  be  unknown,  hidden,  or  obfuscated, 
NLP  is  focused  on  linguistic  analysis  that  trains  a  machine  to  interpret  voluminous 
textual  content.  To  decipher  the  semantics  and  ambiguities  in  human-interpretable 
language,  NLP  employs  automatic  summarization,  tagging,  disambiguation,  extrac¬ 
tion  of  entities  and  relations,  pattern  recognition,  and  frequency  analyses.  As  of 
2018,  the  total  amount  of  information  generated  by  the  human  race  exceeds  6  zetta¬ 
bytes  (1ZB  =  1021  =  270  bytes),  which  is  projected  to  top  50 ZB  by  2020.  The  amount 
of  data  we  obtain,  and  record,  doubles  every  12-14  months  (Kryder’s  law).  A  small 
fraction  of  this  massive  information  (<0.0001%  or  <1PB  =  1015  bytes)  represents 
newly  written  or  transcribed  text,  including  code.  However,  it  is  impossible 
(cf.  efficiency,  time,  resources)  for  humans  to  read,  synthesize,  interpret  and  react 
to  all  this  information  without  direct  assistance  of  TM/NLP.  The  information  content 
in  text  could  be  substantially  higher  than  that  of  other  information  media.  Remember 
that  “a  picture  may  be  worth  a  thousand  words”,  yet,  “a  word  may  also  be  worth  a 
thousand  pictures”.  As  an  example,  the  simple  sentence  “ The  data  science  and 
predictive  analytics  textbook  includes  23  Chapters ”  takes  63  bytes  to  store  as  text; 
however,  a  color  image  showing  this  as  printed  text  could  reach  10  megabytes  (MB), 
and  an  HD  video  of  a  speaker  reading  the  same  sentence  could  easily  surpass  50  MB. 
Text  mining  and  natural  language  processing  may  be  used  to  automatically  analyze 
and  interpret  written,  coded,  or  transcribed  content  to  assess  news,  moods,  emotions, 
clinical  notes,  and  biosocial  trends  related  to  specific  topics. 

In  general,  text  analysis  protocols  involve: 

•  Construction  of  a  document- term  matrix  (DTM)  from  the  input  documents, 
vectorizing  the  text,  e.g.,  creating  a  map  of  single  words  or  n-  grams  into  a 
vector  space.  That  is,  the  vectorizer  is  a  function  mapping  terms  to  indices. 

•  Application  of  a  model-based  statistical  analysis  or  a  model-free  machine  learn¬ 
ing  technique  for  prediction,  clustering,  classification,  similarity  search,  network/ 
sentiment  analysis,  or  forecasting  using  the  DTM.  This  step  also  includes  tuning 
and  internally  validating  the  performance  of  the  method. 

•  Application  and  evaluation  of  the  technique  to  new  data. 

We  are  going  to  demonstrate  this  protocol  with  a  very  simple  example.  Figure  20.1 
points  to  a  separate  online  demo. 


20.1  A  Simple  NLP/TM  Example 
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Fig.  20.1  Live  demo:  Dynamic  NLP  demonstration 


20.1.1  Define  and  Load  the  Unstructured-Text  Documents 

Let’s  create  some  documents  we  can  use  to  illustrate  the  use  of  the  tm  package  for 
text  mining.  The  five  documents  below  represent  portions  of  the  syllabi  of  five  recent 
courses  taught  by  the  author: 

•  HS650:  Data  Science  and  Predictive  Analytics  (DSPA) 

•  Bootcamp:  Predictive  Big  Data  Analytics  using  R 

•  HS  853:  Scientific  Methods  for  Health  Sciences:  Special  Topics 

•  HS851:  Scientific  Methods  for  Health  Sciences:  Applied  Inference,  and 

•  HS550:  Scientific  Methods  for  Health  Sciences:  Fundamentals 

We  import  the  syllabi  into  several  separate  segments  represented  as 
documents. 

•  As  an  exercise ,  try  to  use  the  rvest :  :  read_html  method  to  load  in  the 
five  course  syllabi  directly  from  the  course  websites  listed  above. 

docl  <-"HS650:  The  Data  Science  and  Predictive  Analytics (DSPA)  course  (offered 
as  a  massive  open  online  course j  MOOCj  as  well  as  a  traditional  University  of 
Michigan  class)  aims  to  build  computational  abilitieSj  inferential  thinking j 
and  practical  skills  for  tackling  core  data  scientific  challenges .  It  explores 
foundational  concepts  in  data  management j  processingj  statistical  computing j 
and  dynamic  visualization  using  modern  programming  tools  and  agile  web- 
services.  Concepts j  ideas j  and  protocols  are  illustrated  through  examples  of 
real  observationalj  simulated  and  research-derived  datasets.  Some  prior  quanti¬ 
tative  experience  in  programmingj  calculuSj  statisticsj  mathematical  modelSj 
or  Linear  algebra  will  be  necessary .  This  open  graduate  course  will  provide  a 
general  overview  of  the  principles j  conceptSj  techniques j  tools  and  services 
for  managingj  harmonizing j  aggregating ,  preprocessing j  modelingj  analyzing  and 
interpreting  Largej  multisourcej  incomplete j  incongruentj  and  heterogeneous 
data  (Big  Data).  The  focus  will  be  to  expose  students  to  common  challenges 
related  to  handling  Big  Data  and  present  the  enormous  opportunities  and  power 
associated  with  our  ability  to  interrogate  such  complex  datasetSj  extract  useful 
informationj  derive  knowledge j  and  provide  actionable  forecasting .  Biomedical j 
healthcare j  and  social  datasets  will  provide  context  for  addressing  specific 
driving  challenges .  Students  will  Learn  about  modern  data  analytic  techniques 
and  develop  skills  for  importing  and  exportingj  cleaning  and  fusing j  modeling 
and  visualizing j  analyzing  and  synthesizing  complex  datasets.  The  collaborative 
designj  implementation j  sharing  and  community  validation  of  high  throughput 
analytic  workflows  will  be  emphasized  throughout  the  course. " 
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doc2  <  -"Bootcamp:  A  week- Longintensive  Bootcamp  focused  on  methodSj  tech- 
niqueSj  tooLSj  services  and  resources  for  big  healthcare  and  biomedical  data 
analytics  using  the  open-source  statistical  computing  software  R.  Morning 
sessions  (3  hrs)  will  be  dedicated  to  methods  and  technologies  and  applica¬ 
tions.  Afternoon  sessions  (3  hrs)  will  be  for  group-based  hands-on  practice 
and  team  work.  Commitment  to  attend  the  full  week  of  instruction  (morning 
sessions)  and  self-guided  work  (afternoon  sessions)  is  required.  Certificates 
of  completion  will  be  issued  only  to  trainees  with  perfect  attendance  that 
complete  all  work.  This  hands-on  intensive  graduate  course  (Bootcamp)  will 
providea  general  overview  of  the  principleSj  conceptSj  techniqueSj  tools  and 
services  for  managingj  harmonizingj  aggregating j  preprocessing j  modelingj 
analyzing  and  interpreting  Large j  multi-sourcej  incomplete j  incongruentj  and 
heterogeneous  data  (Big  Data).  The  focus  will  be  to  expose  students  to  common 
challenges  related  to  handling  Big  Data  and  present  the  enormous  opportunities 
and  power  associated  with  our  ability  to  interrogate  such  complex  datasets j 
extract  useful  informationj  derive  knowledgej  and  provide  actionable  forecast¬ 
ing.  Biomedical j  healthcare j  and  social  datasets  will  provide  context  for 
addressing  specific  driving  challenges .  Students  will  Learn  about  modern  data 
analytic  techniques  and  develop  skills  for  importing  and  exporting j  cleaning 
and  fusingj  modeling  and  visualizingj  analyzing  and  synthesizing  complex  data¬ 
sets.  The  collaborative  design j  implementation j  sharing  and  community  valida¬ 
tion  of  high-throughput  analytic  workflows  will  be  emphasized  throughout  the 
course. " 

doc 3  <-  "HS  853:  This  course  covers  a  number  of  modern  analytical  methods  for 
advanced  healthcare  research.  Specific  focus  will  be  on  reviewing  and  using 
innovative  modelingj  computational 3  analytic  and  visualization  techniques  to 
address  concrete  driving  biomedical  and  healthcare  applications .  The  course 
will  cover  the  5  dimensions  of  Big-Data  (volumej  complexity }  multiple  scalesj- 
multiple  sources j  and  incompleteness) .  HS853  is  a  4  credit  hour  course 
(3  Lectures  +  1  Lab/discussion) .  Students  will  Learn  how  to  conduct  research j 
employ  and  report  on  recent  advanced  health  sciences  analytical  methods ;  readj 
comprehend  and  present  recent  reports  of  innovative  scientific  methods; 
apply  a  broad  range  of  health  problems ;  experiment  with  real  Big-Data. Topics 
Covered  include:  Foundations  of  Rj  Scientific  Visualization j  Review  of  Multi¬ 
variate  and  Mixed  Linear  ModelSj  Causality /Causal  Inference  and  Structural 
Equation  ModelSj  Generalized  Estimating  EquationSj  PCOR/CER  methods  Heteroge¬ 
neity  of  Treatment  EffectSj  Big-Dataj  Big-Science j  Internal  statistical 
cross-validation j  Missing  dataj  Genotype-Environment-Phenotype j  associations j 
Variable  selection  (regularized  regression  and  control  Led/knockoff  filter¬ 
ing)  j  medical  imaging j  Databases/registries j  Meta-analyseSj  classification 
methodSj  Longitudinal  data  and  time-series  analysiSj  Geographic  Information 
Systems(GIS)j  Psychometrics  and  Rasch  measurement  model  analysiSj  Bayesian 
inferencej  and  Network  Analysis.  " 

doc3  <-  "HS  851:  This  course  introduces  students  to  applied  inference  methods 
in  studies  involving  multiple  variables .  Specific  methods  that  will  be  dis¬ 
cussed  include  linear  regression j  analysis  of  variancej  and  different  regres¬ 
sion  models.  This  course  will  emphasize  the  scientific  formulation j  analytical 
modelingj  computational  tools  and  applied  statistical  inference  in  diverse 
health-sciences  problems.  Data  interrogation j  modeling  approacheSj  rigorous 
interpretation  and  inference  will  be  emphasized  throughout .  HS851  is  a  4 
credit  hour  course  (3  Lectures  +  1  Lab/discussion) .  Students  will  Learn  how 
to:j  Understand  the  commonly  used  statistical  methods  of  published  scienti¬ 
fic  paperSj  Conduct  statistical  calculations/analyses  on  available  dataj 
Use  software  tools  to  analyze  specific  case-studies  dataj  Communicate  advanced 
statistical  concepts/techniques j  Determine j  explain  and  interpret  assump¬ 
tions  and  Limitations .  Topics  Covered  include  Epidemiology j  Correlation/ 
SLRj  and  slope  inferencej  1-2  samples j  ROC  Curve j  ANOVAj  Non -parametric 
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inferencej  Cronbach ' s  $\aLpha$j  Measurement  ReLiabiLity/VaLidityj  Survival 
Analysis,  Decision  theory ,  CLT/LLNs  -  Limiting  resuLts  and  misconceptions. 
Association  Tests ,  Bayesian  Inferencej  PCA/ICA/Factor  AnaLysiSj  Point/ 
Interval  Estimation  (Cl)  -  MoM,  MLE,  Instrument  performance  Evaluation j 
Study/Research  Critiques ,  Common  mistakes  and  misconceptions  in  using 
probability  and  statisticsj  identifying  potential  assumption  violations ,  and 
avoiding  them. " 

doc5  <-  "HS550:  This  course  provides  students  withan  introduction  to  probabil¬ 
ity  reasoning  and  statistical  inference .  Students  will  Learn  theoretical 
concepts  and  apply  analytic  skills  for  collecting j  managingj  modeling, 
processing,  interpreting  and  visualizing  (mostly  univariate)  data.  Students 
will  Learn  the  basic  probability  modeling  and  statistical  analysis  methods  and 
acquire  knowledge  to  read  recently  published  health  research  publications . 
HS550  is  a  4  credithour  course  (3  Lectures  +  1  Lab/discussion) .  Students  will 
Learn  how  to:  Apply  data  management  strategies  to  sample  data  files.  Carryout 
statistical  tests  to  answer  common  healthcare  research  questions  using  appro¬ 
priate  methods  and  software  tools ,  Understand  the  core  analytical  data  mode¬ 
ling  techniques  and  their  appropriate  use  Examples  of  Topics  Covered, 
EDA/Charts,  Ubiquitous  variation.  Parametric  inference.  Probability  Theory, 
Odds  Ratio/Relative  Risk,  Distributions,  Exploratory  data  analysis,  Resam¬ 
pling/Simulation,  Design  of  Experiments,  Intro  to  Epidemiology,  Estimation, 
Hypothesis  testing.  Experiments  vs.  Observational  studies.  Data  management 
(tables,  streams,  cloud,  warehouses,  DBs,  arrays,  binary,  ASCII,  handling, 
mechanics) ,  Power,  sample-size,  effect-size,  sensitivity,  specificity,  Bias/- 
Precision,  Association  vs.  Causality,  Rate-of -change.  Clinical  vs.  Stat 
significance.  Statistical  Independence  Bayesian  Rule. " 


20.1.2  Create  a  New  VCorpus  Object 

The  VCorpus  object  includes  all  the  text  and  some  meta-data  (e.g.,  indexing)  about 
the  entire  text. 

docs<-c(docl,  doc2,  doc3,  doc4,  doc5) 

class(docs) 

##  [1]  "character" 

We  can  make  a  VCorpus  object  using  tm  package.  To  complete  this  task,  we 
need  to  know  the  source  type.  Here,  docs  has  a  vector  with  "character"  class,  so  we 
should  use  VectorSource  ( ) .  If  it  is  a  dataframe,  we  should  use 
Dataf  rameSource  ( )  instead.  VCorpus  ( )  creates  a  volatile  corpus ,  which  is 
the  data  type  used  by  the  tm  package  for  text  mining. 
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Library ( tm) 

doc_corpus< -VCorpus (VectorSource (docs ) ) 
doc_corpus 

##  <<VCorpus>> 

##  Metadata:  corpus  specific:  0,  document  Level  (indexed):  0 
##  Content:  documents :  5 

doc_corpus [ [1] ]$content 

##  [1]  "HS650:  The  Data  Science  and  Predictive  Analytics  (DSPA)  course  (offe 
red  as  a  massive  open  online  course ,  MOOCj  as  well  as  a  traditional  Universi 
ty  of  Michigan  class)  aims  to  build  computational  abilities ,  inferential  thi 
nhingj  ...  throughout  the  course.  " 

This  generates  a  list  containing  the  information  for  the  five  documents  we  have 
created.  Now  we  can  apply  tm  map  ( )  function  on  this  object  to  preprocess  the  text. 
The  goal  is  to  automatically  interpret  the  text  and  output  more  succinct  information. 


20.1.3  To-Lower  Case  Transformation 

The  text  itself  contains  upper  case  letters  as  well  as  lower  case  letters.  The  first  thing 
to  do  is  to  convert  all  characters  to  lower  case. 

doc_corpus<-tm_map(doc_corpuSj  tolower) 
doc_corpus [ [1] ] 

##  [1]  "hs650:  the  data  science  and  predictive  analytics  (dspa)  course  (offe 
red  as  a  massive  open  online  course ,  mooc}  as  well  as  a  traditional  universi 
ty  of  michigan  class)  ...  community  validation  of  high-throughput  analytic  wor 
k flows  will  be  emphasized  throughout  the  course. " 


20.1.4  Text  Pre-processing 

Remove  Stopwords 

These  documents  contains  a  lot  of  "stopwords",  or  common  words,  that  have 
important  semantic  meaning  but  low  analytic  value.  We  can  remove  these  by  the 
following  command. 

stopwords ( "eng  Lis h ") 


## 

[i] 

"i" 

"me  " 

"my  " 

"myself" 

we 

## 

[6] 

"our  " 

"ours " 

"ourselves" 

"you" 

"your 

## 

[11] 

"yours " 

"yourself" 

"yourselves " 

"he  " 

"him" 

## 

[16] 

"his" 

"himself" 

"she  " 

"her" 

"hers 

##  [171] 

"so  " 

"than  " 

"too" 

"very" 
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doc_corpus<-tm_map(doc_corpuSj  removelAlordSj  stopwords ( "eng Lish" ) ) 
doc_corpus [ [1] ] 

##  [1]  "hs650:  data  science  predictive  analytics  (dspa)  course  (offered 
massive  open  online  course ,  mooCj  well  traditional  university  michigan  c 
Lass)  aims  build  ...,  sharing  community  validation  high-throughput  analytic 
workflows  will  emphasized  throughout  course. " 

We  removed  all  the  stopwords  specified  in  the  stop  words  ( "english" )  list. 
You  can  always  make  your  own  stopword  list  and  just  use  doc_corpus< - - 
tm_map (doc_corpus ,  removeWords ,  your_own_words_list )  to 
apply  this  list. 

From  the  output  of  do  cl  we  notice  the  removal  of  stopwords  creates  extra  blank 
spaces.  Thus,  the  next  step  would  be  to  remove  them. 

doc_corpus<-tm_map(doc_corpuSj  stripWhitespace) 
doc_corpus [ [1] ] 

##  [1]  "hs650:  data  science  predictive  analytics  (dspa)  course  (offered  mass 
ive  open  online  course ,  mooCj  well  traditional  university  michigan  class)  ai 
ms  build  computational  abilitieSj  ...,  sharing  community  validation  high-throu 
ghput  analytic  workflows  will  emphasized  throughout  course. " 


Remove  Punctuation 

Now  we  notice  the  irrelevant  punctuation  in  the  text,  which  can  be  removed  by  using 
a  combination  of  tm_map  ( )  and  removePunctuation  ( )  functions. 

doc_corpus<-tm_map(doc_corpuSj  removePunctuation ) 
doc_corpus[ [2] ] 

##  [1]  "bootcamp  weeklong  intensive  bootcamp  focused  methods  techniques  tool 
s  services  resources  big  healthcare  biomedical  data  analytics  using  opensour 
ce  statistical  computing  software  r  morning  sessions  3  hrs  ...  collaborative  d 
esign  implementation  sharing  community  validation  highthroughput  analytic  wo 
rkflows  will  emphasized  throughout  course" 

The  above  tm  map  commands  changed  the  structure  of  our  doc_corpus 
object.  We  may  apply  PlainTextDocument  function  if  we  need  to  convert  it 
back  to  the  original  format. 


doc_corpus<-tm_map(doc_corpuSj  PlainTextDocument) 


Stemming:  Removal  of  Plurals  and  Action  Suffixes 

Let’s  inspect  the  first  three  documents.  We  notice  that  there  are  some  words  ending 
with  “ing”,  “es”,  or  “s” 
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doc_corpus [ [1] ]$content 

##  [1]  "hs650  data  science  predictive  analytics  dspa  course  offered  massive 
open  online  course  mooc  well  traditional  university  michigan  class  aims  buil 
d  computational  abilities  inferential  ...  validation  highthroughput  analytic  w 
or kf lows  will  emphasized  throughout  course" 

doc_corpus [ [2] ]$content 

##  [1]  "bootcamp  weeklong  intensive  bootcamp  focused  methods  technigues  tool 
s  services  resources  big  healthcare  biomedical  data  analytics  using  opensour 
ce  statistical  computing  software  r  morning  sessions  3  ...  design  implementat 
ion  sharing  community  validation  highthroughput  analytic  workflows  will  emph 
asized  throughout  course" 

doc_corpus[[3]]$content 

##  [1]  "hs  853  course  covers  number  modern  analytical  methods  advanced  healt 
hcare  research  specific  focus  will  reviewing  using  innovative  modeling  compu 
tational  analytic  visualization  ...  information  systems  gis  psychometrics  rase 
h  measurement  model  analysis  mcmc  sampling  bayesian  inference  network  analys 
is" 


If  we  have  multiple  terms  that  only  differ  in  their  endings  (e.g.,  past,  present, 
present-perfect-continuous  tense),  the  algorithm  will  treat  them  differently  because  it 
does  not  understand  language  semantics,  the  way  a  human  would.  To  make  things 
easier  for  the  computer,  we  can  delete  these  endings  by  “stemming”  documents. 
Remember  to  load  the  package  SnowballC  before  using  the  function 
stemDocument  () .  The  earliest  stemmer  was  written  by  Julie  Beth  Lovins  in 
1968,  which  had  great  influence  on  all  subsequent  work.  Currently,  one  of  the  most 
popular  stemming  approaches  was  proposed  by  Martin  Porter  and  is  used  in 
stemDocument  ( ) ,  and  you  can  read  more  on  Porter  algorithm  online. 


#  install . packages ( "SnowballC" ) 
library (SnowballC) 

doc_corpus<-tm_map(doc_corpuSj  stemDocument) 
doc_corpus [ [1] ]$content 

##  [1]  "hs650  data  scienc  predict  analyt  dspa  cours  offer  massiv  open  onlin 
cours  mooc  well  tradit  univers  michigan  class  aim  build  comput  abil  inferent 
i  think  practic  skill  tackl  core  data  scientif  ...  fuse  model  visual  analyz  sy 
nthes  complex  dataset  col  Labor  design  implement  share  communiti  valid  highth 
roughput  analyt  workflow  will  emphas  throughout  cours" 

This  stemming  process  has  to  be  done  after  the  PlainTextDocument  function 
because  stemDocument  can  only  be  applied  to  plain  text. 


20.1.5  Bags  of  Words 

It’s  very  useful  to  be  able  to  tokenize  text  documents  into  n- grams,  sequences  of 
words,  e.g.,  a  2 -gram  represents  two-word  phrases  that  appear  together  in  order. 
This  allows  us  to  form  bags  of  words  and  extract  information  about  word  ordering. 
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The  bag  of  words  model  is  a  common  way  to  represent  documents  in  matrix  form 
based  on  their  term  frequencies  (TFs).  We  can  construct  an  n  x  t  document- term 
matrix  (DTM),  where  n  is  the  number  of  documents,  and  t  is  the  number  of  unique 
terms.  Each  column  in  the  DTM  represents  a  unique  term.  For  instance,  the  (ij)th 
cell  represents  how  many  of  term  j  are  present  in  document  i. 

The  basic  bag  of  words  model  is  invariant  to  ordering  of  the  words  within  a 
document.  Once  we  compute  the  DTM,  we  can  use  machine  learning  techniques  to 
interpret  the  derived  signature  information  contained  in  the  resulting  matrices. 


20.1.6  Document  Term  Matrix 

Now  that  the  doc_corpus  object  is  quite  clean,  we  can  make  a  document- term 
matrix  to  explore  all  the  terms  in  the  five  initial  documents.  The  document  term 
matrix  includes  dummy  variables  that  tell  us  if  a  specific  term  appears  in  a  specific 
document. 

doc_dtm<-TermDocumentMatrix( doc_corpus ) 
doc_dtm 

##  <<TermDocumentMatrix  (terms:  329 ,  documents :  5)>> 

##  Non-/sporse  entries:  549/1105 
##  Sparsity  :  67% 

##  Maxima L  term  Length:  27 

##  Weighting  :  term  frequency  (tf) 

The  summary  of  document  term  matrix  is  informative.  We  have  329  different 
terms  in  the  five  documents.  There  are  540  non-zero  and  1,105  sparse  entries.  Thus, 
the  sparsity  is  (54(]  j ^ Q5 }  «  67%,  which  measures  the  term  sparsity  across  all  docu¬ 
ments.  A  high  sparsity  means  that  the  terms  are  not  repeated  often  among  different 
documents. 

Recall  that  we  applied  PlainTextDocument  function  to  your  doc_corpus 
object.  This  removed  all  document  meta  data.  To  relabel  the  documents  in  the 
document  term  matrix,  we  can  use  the  following  commands: 

doc_dtm$dimnames$Docs< -as . character ( 1:5) 
inspect ( doc_dtm ) 

##  <<TermDocumentMatrix  (terms:  329 }  documents :  5)>> 

##  Non-/sparse  entries:  540/1105 
##  Sparsity  :  67 % 

##  Maxima L  term  Length:  27 

##  Weighting  :  term  frequency  (tf) 

##  SampLe 
##  Docs 

##  Terms  12345 
##  anaLyt  33312 

##  cours  42332 

##  data  75236 

##  infer  0  0  2  6  2 

##  method  02532 

##  modeL  32433 

##  statist  21154 

##  student  22124 

##  use  22132 

##  wiLL  68343 
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We  might  want  to  find  and  report  the  frequent  terms  using  this  document  term 
matrix. 


findFreqTerms ( doc_dtmj 

Lowfreq  =2) 

##  [1]  "abiL" 

"action " 

"address " 

"advanc" 

##  [5]  "afternoon" 

"aggreg" 

"analysi " 

"analyt" 

##  [9]  "anaLyz" 

"appli" 

"applic" 

"appropri " 

##  [13]  "associ" 

"assumpt" 

"attend" 

"bayesian " 

##  [17]  "big" 

"bigdata" 

"biomed" 

"bootcamp" 

##  [21]  "chaLLeng" 

"clean " 

"collabor" 

"common " 

##  [25]  "communiti" 

"comp  Let" 

"complex " 

"comput" 

##  [29]  "concept" 

"conduct" 

"context " 

"core " 

##  [33]  "cours" 

"cover" 

"credit" 

"data " 

##  [37]  "dataset" 

"deriv" 

"design " 

"develop " 

##  [41]  "drive" 

"emphas " 

"enorm " 

"epidemiolog" 

##  [45]  "equat" 

"estim" 

" examp L " 

"expert " 

##  [49]  "export" 

"expos " 

"extract" 

"focus  " 

##  [53]  "forecast" 

"foundat " 

"fuse" 

"general " 

##  [57]  "graduat" 

"hand" 

"handl " 

"harmon " 

##  [61]  "  health" 

"healthcar" 

"heterogen" 

"highthroughput 

##  [65]  "hour" 

"hrs  " 

"hs550" 

"implement" 

##  [69]  "import" 

"includ" 

"incomplet" 

"incongru" 

##  [73]  "infer" 

"inform  " 

"innov" 

"intens " 

##  [77]  "interpret" 

"  inter rog " 

"knowledg" 

"  Labdiscuss" 

##  [81]  "Larg" 

"Learn " 

"lectur" 

"Limit" 

##  [85]  "Linear" 

"manag " 

"measur" 

"method" 

##  [89]  "misconcept" 

"model " 

"modern " 

"morn " 

##  [93]  "muLtipL" 

"multisourc" 

"observ" 

"open " 

##  [97]  "opportun" 

"overview" 

"power " 

"practic" 

##  [101]  "preprocess" 

"present" 

"principl " 

"probabl" 

##  [105]  "problem" 

"process " 

"program " 

"provid" 

##  [109]  "publish" 

"read" 

"real" 

"recent" 

##  [113]  "regress" 

"relat" 

"report " 

"research" 

##  [117]  "review" 

"sampl " 

"scienc" 

"scientif" 

##  [121]  "servic" 

"session " 

"share" 

"skill"  ’ 

##  [125]  "social" 

"softwar" 

"specif" 

"statist" 

##  [129]  "student" 

"studi " 

"synthes " 

"techniqu" 

##  [133]  "test" 

"theori " 

"throughout" 

"tool" 

##  [137]  "topic" 

"understand" 

"use" 

"valid" 

##  [141]  "variabl" 

"visual " 

"wi 1 1 " 

"work " 

##  [145]  "workflow" 

This  gives  us  the  terms  that  appear  in  at  least  two  documents.  High-frequency 
terms  like  comput,  statist,  model,  healthcar,  learn  make  perfect  sense 
to  be  included  as  these  syllabi  describe  courses  that  cover  modeling,  statistical  and 
computational  methods  with  applications  to  health  sciences. 

The  tm  package  also  provides  the  correlation  between  terms.  Here  is  a  mechanism 
to  determine  the  words  that  are  highly  correlated  with  statist,  ( p(statist ,  ?)  >  0.8). 

findAssocs(doc_dtmj  "statist" j  cor  Limit  =  0.8) 

##  $statist 


## 

epidemiolog 

publish 

studi 

theori 

understand 

appli 

## 

0.95 

0.95 

0.95 

0.95 

0.95 

0.83 

## 

test 

## 

0.80 
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20.2  Case-Study:  Job  Ranking 

Let’s  explore  some  real  datasets.  First,  we  will  import  the  2011  USA  Jobs  Ranking 
Dataset  from  SOCR  data  archive. 


Library (rvest) 

wihi_url  <-  read_htmi( "http: //wiki.socr.  umich.edu/index.php/SOCR_Data_2011_U 
S_JobsRanking ") 

htmL_nodes ( \Aiiki_urlj  "# content ") 

##  {xmL_nodeset  (1)} 

##  [1]  <div  id="content"  class="mw-body -primary"  roie="main" >\n\t<a  id="top 


job  <-  htmL_tabLe(htmL_nodes(\A)iki_urLj  "tabLe" ) [ [1] ] ) 
head (job) 


## 

Index 

Job_TitLe 

OveraLL_Score  Average_Income(USD) 

##  1 

1 

Software_Engineer 

60 

87140 

##  2 

2 

Mathematician 

73 

94178 

##  3 

3 

Actuary 

123 

87204 

##  4 

4 

Statistician 

129 

73208 

##  5 

5  Computer_Systems_AnaLyst 

147 

77153 

##  6 

6 

Meteorologist 

175 

85210 

##  \4ork_Environment  Stress_LeveL  Stress_Category  PhysicaL_Demand 


##  1 

150.00 

10.40  1 

5.00 

##  2 

89.72 

12.78  1 

3.97 

##  3 

179.44 

16.04  1 

3.97 

##  4 

89.52 

14.08  1 

3.95 

##  5 

90.78 

16.53  1 

5.08 

##  6 

179.64 

15.10  1 

6.98 

## 

Hiring_Potential 

##  1 

27.40 

##  2 

19.78 

##  3 

17.04 

##  4 

11.08 

##  5 

15.53 

##  6 

12.10 

## 

Description 

##  1 

Resear ches_designs_ 

deveLops_and_maintains_software_systems_aLong_iAjith_ 

ardiA/are_development_for_medical_scientific_and_industrial 

_purposes 

##  2  Applies_mathematical_theories_and_formuLas_to_teach_ 

or_solve_problems_ 

n  a  business  educational 

or  industrial  climate 

##  3  Interprets_statistics_to_determine_probabiLities_of_accidents_sickness 

_and_death_and_Loss_of_property_from_theft_and_naturaL_disasters 

##  4  TabuLates_anaLyzes_and_interprets_the_numeric_resuLts_of_experiments_ 

and_surveys 

##  5  Plans_and_develops_computer_systems_for_businesses_and_scientific_ins 
titutions 

##  6  St udies_t he_phy sica L_c har act eristic s_motions_and_processes_of_the_ear 

th '  s_atmosphere 


Note  that  low  indices  represent  jobs  that  in  2011  were  highly  desirable.  Thus,  in 
2011,  the  most  desirable  job  among  the  top  200  common  jobs  would  be  Software 
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Engineer.  The  aim  of  our  case  study  is  to  explore  the  difference  between  the  top 
30  desirable  jobs  and  the  last  100  jobs  in  the  list. 

We  will  go  through  the  same  procedure  as  we  did  for  the  simple  course  syllabi 
example.  The  documents  we  will  be  using  include  the  Description  column 
(a  vector)  in  the  dataset. 


20.2.1  Step  1:  Make  a  VCorpus  Object 

jobCorpus<-VCorpus(VectorSource(job[j  10] ) ) 


20.2.2  Step  2:  Clean  the  VCorpus  Object 


jobCorpus< -tm_map( jobCorpus }  to  Lower) 
for(j  in  seq(jobCorpus) ){ 

jobCorpus[ [j ] ]<-gsub("_"j  "  ",  jobCorpus[ [j ] ] ) 

}  ' 


Here  we  used  a  loop  to  substitute  with  blank  space.  This  is  because  when  we 
use  remove  Punctuation,  all  the  underline  characters  will  disappear  and  there 
will  be  no  separation  between  terms.  In  this  situation,  gsub  will  be  the  best  choice 
to  use. 

jobCorpus<-tm_map( jobCorpus ,  removelAlords ,  stopwords( "engLish" ) ) 
jobCorpus< -tm_map (jobCorpus ,  removePunctuation ) 
jobCorpus< -tm_map( jobCorpus  j  stripUhitespace) 
jobCorpus<-tm_map( jobCorpus ,  PLainTextDocument) 
jobCorpus<-tm_map( jobCorpus ,  stemDocument) 


20.2.3  Step  3:  Build  the  Document  Term  Matrix 

The  Document  Term  Matrix  (DTM)  objects  (tm :  :  DocumentTermMatrix)  con¬ 
tains  a  sparse  term-document  matrix,  or  document-term  matrix,  and  attribute  weights 
of  the  matrix. 

First,  make  sure  that  we  got  a  clean  VCorpus  object. 
jobCorpus [ [1] ]$content 

##  [1]  "research  design  develop  maintain  softwar  system  along  hardwar  develo 
p  medic  scientif  industri  purpos" 
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Then,  we  can  start  to  build  the  DTM  and  reassign  the  labels  to  the  Docs. 

dtm<-DocumentTermMatrix(jobCorpus  ) 
dtm 

##  <<DocumentTermMatrix  (documents :  200 ,  terms:  846) >> 

##  Non-/sparse  entries:  1818/167382 
##  Sparsity  :  99% 

##  Maxima L  term  Length:  15 

##  Weighting  :  term  frequency  (tf) 

dtm$dimnames$Docs< -as . character ( 1 :200) 
inspect (dtm[l: 10 j  1:10]) 

##  <<DocumentTermMatrix  (documents :  10 ,  terms:  10) >> 

##  Non-/sparse  entries:  2/98 

##  Sparsity  :  98% 

##  Maxima L  term  Length:  7 

##  Weighting  :  term  frequency  (tf) 

##  SampLe 
##  Terms 


## 

Docs  16wheeL 

abnorm 

access 

accid 

accord  account 

accur 

achiev  act 

activ 

## 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

## 

10 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

## 

2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

## 

3 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

## 

4 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

## 

5 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

## 

6 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

## 

7 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

## 

8 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

## 

9 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

Let’s  subset 

the  dtm  into  the  top  30  jobs  and 

the  bottom  100  jobs. 

dtm_top30<-dtm[l : 30j  ] 
dtm_botl00<-dtm[101 : 200j  ] 
dtm_top30 

##  <<DocumentTermMatrix  (documents :  30 ,  terms:  846) >> 
##  Non-/sparse  entries:  293/25087 

##  Sparsity  :  99% 

##  Maxima L  term  Length:  15 

##  Weighting  :  term  frequency  (tf) 

dtm_botl00 


##  <<DocumentTermMatrix 
##  Non-/sparse  entries: 
##  Sparsity 

##  Maxima L  term  Length: 
##  Weighting 


(documents :  100 ,  terms:  846) >> 

870/83730 

99% 

15 

term  frequency  (tf) 


In  this  case,  since  the  sparsity  is  very  high,  we  can  try  to  remove  some  words  that 
rarely  appear  in  the  job  descriptions. 
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dtms_top30< - removeSparseTerms (dtm_top30 }  0.90) 
dtms_top30 

##  <<DocumentTermMatrix  (documents :  30 ,  terms:  19) >> 

##  Non-/sparse  entries:  70/500 
##  Sparsity  :  88% 

##  Maxima L  term  Length:  10 

##  Weighting  :  term  frequency  (tf) 

dtms_botl00< - removeSparseTerms (dtm_bot 100 }  0. 94) 
dtms_botl00 

##  <<DocumentTermMatrix  (documents :  100 ,  terms:  14) >> 

##  Non-/sparse  entries:  122/1278 
##  Sparsity  :  91% 

##  Maxima L  term  Length:  10 

##  Weighting  :  term  frequency  (tf) 

Now,  instead  of  846  terms,  we  only  have  19  that  appear  in  the  top  30  job 
descriptions  (JDs)  and  14  that  appear  in  the  bottom  100  JDs. 

Similar  to  what  we  did  in  Chap.  8,  visualization  of  the  terms-world  clouds  may 
be  accomplished  by  combining  the  tm  with  wordcloud  packages.  First,  we  can 
count  the  term  frequencies  in  the  two  document  term  matrices  (Fig.  20.2). 


term 


Fig.  20.2  Frequency  plot  of  commonly  occurring  terms  (bottom  100  jobs) 
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#  Calculate  the  cumulative  frequencies  of  words  across  documents  and  sort: 

freql<-sort( coLSums(as .matrix(dtms_top30) ),  decreasing=T) 

freql 


## 

deveLop 

assist 

natur 

studi 

anaLyz 

concern 

## 

6 

5 

5 

5 

4 

4 

## 

individu 

industri 

physic 

pLan 

busi 

inform 

## 

4 

4 

4 

4 

3 

3 

## 

institut 

probLem 

research 

scientif 

theori 

treatment 

## 

3 

3 

3 

3 

3 

3 

## 

understand 

## 

3 

freq2<-sort( coLSums(as . matrix (dtms_bot 100) )}  decreasing=T) 
freq2 


## 

oper 

repair 

perform 

instaL 

buiLd 

prepar 

## 

17 

15 

11 

9 

8 

8 

## 

busi 

commerci 

construct 

industri 

machin  manufactur 

## 

7 

7 

7 

7 

7 

7 

## 

product 

transport 

## 

7 

7 

#  Plot 

wf=data.  frame  ( term=names (freq2 )}  occurrences=freq2 ) 

Library (ggpiot2) 

## 

##  Attaching  package:  'ggpiot2' 

##  The  foLLowing  object  is  masked  from  ' package : NLP ' : 

## 

##  annotate 

P  <~  ggpLot ( subset (lAjfj  freq2>2)j  aes(termj  occurrences)) 
p  <-  p  +  geom_bar(stat=" identity") 

p  <-  p  +  theme (axis .text .x=eLement_text(angLe=45j  hjust=l)) 

P 

Then,  we  apply  the  wordc  1  oud  function  to  the  f  req  dataset  (Figs.  20.3  and  20.4). 


Fig.  20.3  Wordle  plot  of 
the  frequently  occurring 
terms  in  the  top-30  jobs 


concern 

physic 

I  natur 

n  treatment 

£  research  institut 
ana|yzunders?and  Studi 

t  theori  scientif 

1  busip|an 

individu 
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Fig.  20.4  The  appearance 
of  the  wordle  plot  may  be 
customized  as  shown  here 
for  theh  bottom- 100  jobs 


Library ( wordcLoud) 
set.seed(123) 

wordcLoud(names(freql)j  freql) 

#  Color  code  the  frequencies  using  an  appropriate  color  map: 

#  Sequential  palettes  names  include: 

#  Blues  BuGn  BuPu  GnBu  Greens  Greys  Oranges  OrRd  PuBu  PuBuGn  PuRd  Purples  Rd 
Pu  Reds  YlGn  YlGnBu  YlOrBr  YlOrRd 

#  Diverging  palettes  include 

#  BrBG  PiYG  PRGn  PuOr  RdBu  RdGy  RdYlBu  RdYlGn  Spectral 

wordcLoud(names(freq2)j  freq2j  min .freq=5j  coLors=brewer.paL(6j  "Spectral")) 

It  is  apparent  that  the  top  30  jobs  focus  more  on  research  or  discovery  of  new 
things,  and  include  frequent  keywords  like  “study”,  “nature”,  and  “analyze.”  The 
bottom  100  jobs  more  focused  of  operating  on  existing  objects,  with  frequent 
keywords  like  “operation”,  “repair”,  and  “perform”. 


20.2.4  Area  Under  the  ROC  Curve 

In  Chap.  14,  we  talked  about  the  ROC  curve.  We  can  use  document  term  matrices  to 
build  classifiers  and  use  the  area  under  the  ROC  curve  (AUC)  to  evaluate  those 
classifiers.  Assume  that  we  want  to  predict  whether  a  job  ranks  in  the  top  30,  i.e.,  the 
most  desired  jobs.  The  first  task  would  be  to  create  an  indicator  of  high  rank  jobs  (top 
30).  We  can  use  the  if  else  ( )  function  that  we  are  already  familiar  with. 


job$highrank<-ifeLse(job$Index<30j  1,  0) 
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80  79  77  76  76  74  73  72  69  65  64  57  50  35  6  4  1 


Fig.  20.5  The  area  under  the  curve  (AUC)  measures  the  performance  of  the  cross-validated 
LASSO-regularized  model  of  job-ranking  against  the  magnitude  of  the  regularization  parameter 
(bottom  axis),  and  the  efficacy  of  the  model  selection,  i.e.,  number  of  non-trivial  coefficients  (top 
axis).  The  vertical  dash  lines  suggest  an  optimal  range  for  the  penalty  term  and  the  number  of 
coefficients,  see  Chap.  18 


Next  we  load  the  glmnet  package  to  help  us  build  the  model  and  draw  the 
corresponding  graphs. 

#install . packages ("glmnet") 

Library (gimnet) 

The  function  we  will  be  using  is  the  cv .  glmnet,  cv  stands  for  cross-validation. 
Since  the  highrank  variable  is  binary,  we  specify  the  option  family  =  '  bino¬ 
mial  ' .  Also,  we  want  to  use  10-fold  CV  method  for  re-sampling  (Fig.  20.5). 

set.seed(25) 

fit  <-  cm .gimnet (x  =  as. matrix ( dtm) ,  y  =  job[ [' highrank ']] , 

famiLy  =  '  binomial' , 

#  lasso  penalty 
alpha  =  lj 

#  interested  in  the  area  under  ROC  curve 
type .measure  =  "auc"j 

#  10-fold  cross-validation 
nfolds  =  10j 

#  high  value  is  less  accurate,  but  has  faster  training 
thresh  =  le-3 , 

#  again  lower  number  of  iterations  for  faster  training 
maxit  =  le3) 


pLot(fit) 
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print (paste (" max  AUC  =",  round (max(fit$cvm) ,  4))) 

##  [1]  "max  AUC  =  0.7276" 

Here,  x  is  a  matrix  and  y  is  the  response  variable.  The  last  line  of  code  helps  us 
select  the  best  AUC  among  all  models.  The  resulting  AUC  ~  0.73  represents  a 
relatively  good  prediction  model  for  this  small  sample  size. 


20.3  TF-IDF 

To  enhance  the  performance  of  the  DTM  matrix,  we  introduce  TF-IDF  (term 
frequency  -  inverse  document  frequency).  Unlike  pure  frequency,  TF-IDF  mea¬ 
sures  the  relative  importance  of  a  term.  If  a  term  appears  in  almost  every  document, 
the  term  will  be  considered  common  with  a  small  weight.  Alternatively,  the  rare 
terms  would  be  considered  more  informational. 


20.3.1  Term  Frequency  (TF) 


TF  is  the  ratio 
Symbolically, 


_ a  term  s  occurrences  in  a  document _ 

the  number  of  occurrences  of  the  most  frequent  word  within  the  same  document’ 


TF(t,  d) 


fd(t) 


w£d 


20.3.2  Inverse  Document  Frequency  (IDF) 

The  TF  definition  may  allow  high  scores  for  irrelevant  words  that  naturally  show  up 
often  in  a  long  text,  even  after  triaging  common  words  in  a  prior  preprocessing  step. 
The  IDF  attempts  to  rectify  that.  IDF  represents  the  inverse  of  the  share  of  the 
documents  in  which  the  regarded  term  can  be  found.  The  lower  the  number  of 
documents  containing  the  term,  relative  to  the  size  of  the  corpus,  the  higher  the  term 
factor. 

IDF  involves  a  logarithm  function,  to  temper  the  effective  scoring  penalty  of 
showing  up  in  two  documents,  which  othersize  may  be  too  extreme.  Typically,  the 
IDF  for  a  term  found  in  just  one  document  is  twice  the  IDF  for  another  term  found  in 
two  docs.  The  ln()  function  rectifies  this  bias  of  ranking  in  favor  of  rare  terms,  even  if 
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the  TF-factor  may  be  high.  It  is  rather  unlikely  that  a  term’s  relevance  is  only  high  in 
one  doc  and  not  all  others. 


IDF(t,  D )  =  In 


D 


{deD  :  ted} 


20.3.3  TF-IDF 


Both  TF  and  IDF  yield  high  scores  for  highly  relevant  terms.  TF  relies  on  local 
information  (search  over  d ),  whereas  IDF  incorporates  a  more  global  perspective 
(search  over  D).  The  product  TF  x  IDF ,  gives  the  classical  TF-IDF  formula. 
However,  alternative  expressions  may  be  formulated  to  get  other  univariate  expres¬ 
sions  using  alternative  weights  for  TF  and  IDF. 

TFJDF(t,d,D )  =  TF(t,d)  x  IDF(t,D). 

An  example  of  an  alternative  TF-IDF  metric  can  be  defined  by: 

TFJDF'{t ,  d.  D)  =  /£>f^’|  +  TFJDF(t,  d,  D) . 

Let’s  make  another  DTM  with  TF-IDF  weights  and  compare  the  differences  between 
the  unweighted  and  weighted  DTM. 


dtm.  tfidf  <-DocumentTermMatrix(jobCorpuSj  controL  =  List (weighting=weightT fid 

f)) 

dtm. tfidf 

##  <<DocumentTermMatrix  (documents :  200 ,  terms:  846) >> 

##  Non-/sparse  entries:  1818/167382 
##  Sparsity  :  99% 

##  Maxima L  term  Length:  15 

##  Weighting  :  term  frequency  -  inverse  document  frequency  (norma Li 

zed)  (tf-idf) 

dtm. tfidf $dimnames$Docs  <-  as. character (1:200) 
inspect ( dtm. tfidf [1:9,  1:10]) 

##  <<DocumentTermMatrix  (documents :  9}  terms:  10) >> 

##  Non-/sparse  entries:  2/88 
##  Sparsity  :  98% 

##  Maxima L  term  Length:  7 

##  Weighting  :  term  frequency  -  inverse  document  frequency  (normaLi 

zed)  (tf-idf) 

##  SampLe 
##  Terms 

##  Docs  16wheeL  abnorm  access  accid  accord  account  occur  achiev  act 

##  1  000  0.0000000  0.0000000  00  00 

##  2  000  0.0000000  0.0000000  00  00 
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## 

3 

0 

0 

0 

0.5536547 

0. 0000000 

## 

4 

0 

0 

0 

0.0000000 

0. 0000000 

## 

5 

0 

0 

0 

0. 0000000 

0. 0000000 

## 

6 

0 

0 

0 

0.0000000 

0. 0000000 

## 

7 

0 

0 

0 

0.0000000 

0. 0000000 

## 

8 

0 

0 

0 

0. 0000000 

0.4321928 

## 

9 

0 

0 

0 

0. 0000000 

0. 0000000 

## 

Terms 

## 

Docs 

activ 

## 

1 

0 

## 

2 

0 

## 

3 

0 

## 

4 

0 

## 

5 

0 

## 

6 

0 

## 

7 

0 

## 

8 

0 

## 

9 

0 

0 

0 

0 

0 

0 

0 

0 


0 

0 

0 

0 

0 

0 

0 


inspect(dtm[l:9j  1:10]) 


##  <<DocumentTermMatrix  (documents :  9 ,  terms:  10) >> 

##  Non-/sparse  entries:  2/88 
##  Sparsity  :  98% 

##  Maxima L  term  Length:  7 

##  Weighting  :  term  frequency  (tf) 

##  Sample 
##  Terms 

##  Docs  16wheeL  abnorm  access  accid  accord  account  accur  achiev 


## 

## 

## 

## 

## 

## 

## 

## 

## 


1 

2 

3 

4 

5 

6 

7 

8 
9 


0 

0 

0 

0 

0 

0 

0 

0 

0 


0 

0 

0 

0 

0 

0 

0 

0 

0 


0  0  0 
0  0  0 
0  10 
0  0  0 
0  0  0 
0  0  0 
0  0  0 
0  0  1 
0  0  0 


0  0  0 
0  0  0 
0  0  0 
0  0  0 
0  0  0 
0  0  0 
0  0  0 
0  0  0 
0  0  0 


0  0 
0  0 
0  0 
0  0 
0  0 
0  0 
0  0 


act  activ 


0 

0 

0 

0 

0 

0 

0 

0 

0 


0 

0 

0 

0 

0 

0 

0 

0 

0 


An  inspections  of  the  two  different  DTMs  suggests  that  TF-IDF  is  not  only 
counting  the  frequency  but  also  assigning  different  weights  to  each  term  according 
to  the  importance  of  the  term.  Next,  we  are  going  to  fit  another  model  with  this  new 
DTM  (dtm.  tfidf)  (Fig.  20.6). 


set,seed(2) 

fitl  <-  cv . g Lmnet (x  =  as ,matrix( dtm. tfidf ) ,  y  =  job[ [ ' highranh  ' ] ] j 

family  =  '  binomial' j 

#  lasso  penalty 
alpha  =  lj 

#  interested  in  the  area  under  ROC  curve 
type. measure  =  "auc"j 

#  10-fold  cross-validation 
nfolds  =  10j 

#  high  value  is  less  accurate,  but  has  faster  training 
thresh  =  le-3} 

#  again  lower  number  of  iterations  for  faster  training 
maxit  =  le3) 


plot (fitl ) 
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76  72  72  71  72  68  66  66  67  66  65  66  60  48  10  5  1 


Fig.  20.6  AUC-based  performance  of  the  cross-validated  LASSO-regularized  model  of 
job-ranking  based  on  the  new  DTM  (dtm .  tfidf ),  see  Fig.  20.5  and  Chap.  18 


print (paste (" max  AUC  =",  round (max(fitl$cvm) ,  4))) 

##  [1]  "max  AUC  =  0.7125" 

This  output  is  about  the  same  as  the  previous  jobs  ranking  prediction  classifier 
(based  on  the  unweighted  DTM).  Due  to  random  sampling,  each  run  of  the  protocols 
may  generate  slightly  different  results.  The  idea  behind  using  TF-IDF  is  that  one 
would  expect  to  get  more  unbiased  estimates  of  word  importance.  If  the  document 
includes  stop  words,  like  “the”  or  “one”,  the  DTM  may  distort  the  results,  but  TF-IDF 
may  resolve  some  of  these  problems. 

Next,  we  can  report  a  more  intuitive  representation  of  the  job  ranking  prediction 
reflecting  the  agreement  of  the  binary  (top-30  or  not)  classification  between  the  real 
labels  and  the  predicted  labels.  Notice  that  this  applies  only  to  the  training  data  itself. 

#  Binarize  the  LASSO  probability  prediction 

preffitl  <-  predict(fitlj  newx=as .matrix(dtm. tfidf ) ,  s=" Lambda. min" ,  type  = 
"cLass") 

binPredfitl  <-  ifeise(preffitl<0.5j  0j  1) 
table (binPredfitlj  job[ [ ' highrank  ]]) 

## 

##  binPredfitl  0  1 

##  0  171  0 

##  1  0  29 

Let’s  try  to  predict  the  job  ranking  of  several  new  (testing  or  validation)  job 
descriptions  (JDs).  There  are  many  job  descriptions  provided  online  that  we  can 
extract  text  from  to  predict  the  job  ranking  of  the  corresponding  positions.  Trying 
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several  alternative  job  categories,  e.g.,  some  high-tech  or  fin-tech,  and  some 
manufacturing  or  construction  jobs,  may  provide  some  intuition  to  the  power  of 
the  jobs-classifier  we  built.  Below,  we  will  compare  the  JDs  for  the  positions  of 
accountant ,  attorney ,  and  machinist. 

#  install . packages ( "text 2vec" ) ;  install . packages ( "data .table" ) 

Library ( text2vec ) 

Library (data. tabLe) 

#  Choose  the  ID  for  a  PUBLIC  ACCOUNTANTS  1430  (https://www.bls.gov/ocs/ocsjo 
bde. htm) 

xTest Ac  count ant  <-  "Performs  professionaL  auditing  work  in  a  pubLic  accounting 
firm.  Work  requires  at  Least  a  bacheLor ' s  degree  in  accounting .  Participates 
in  or  conducts  audits  to  ascertain  the  fairness  of  financiaL  representations 
made  by  cLient  companies .  May  aLso  assist  the  cLient  in  improving  accounting 
procedures  and  operations .  Examines  financiaL  reports j  accounting  records j  and 
reLated  documents  and  practices  of  cLients.  Determines  whether  aLL  important 
matters  have  been  disc  Los ed  and  whether  procedures  are  consistent  and  conform 
to  acceptabLe  practices .  SampLes  and  tests  transactions j  internaL  controLSj 
and  other  eLements  of  the  accounting  system(s)  as  needed  to  render  the 
accounting  firm's  finaL  written  opinion.  As  an  entry  LeveL  pubLic  accountant j 
serves  as  a  junior  member  of  an  audit  team.  Receives  cLassroom  and  on-the-job 
training  to  provide  practicaL  experience  in  appLying  the  principLeSj  theorieSj 
and  concepts  of  accounting  and  auditing  to  specific  situations .  (Positions 
heLd  by  trainee  pubLic  accountants  with  advanced  degreeSj  such  as  MBA's  are 
excLuded  at  this  LeveL.)  CompLete  instructions  are  furnished  and  work  is 
reviewed  to  verify  its  accuracyj  conformance  with  required  procedures  and 
instructions j  and  usefuLness  in  faciLitating  the  accountant's  professionaL 
growth.  Any  technicaL  probLems  not  covered  by  instructions  are  brought  to  the 
attention  of  a  superior.  Carries  out  basic  audit  tests  and  procedures j  such 
as:  verifying  reports  against  source  accounts  and  records ;  reconciLing  bank 
and  other  accounts ;  and  examining  cash  receipts  and  disbursements j  payroLL 
recordSj  requisitionsj  receiving  reportSj  and  other  accounting  documents  in 
detaiL  to  ascertain  that  transactions  are  proper Ly  supported  and  recorded. 
Prepares  seLected  portions  of  audit  working  papers" 

xTestAttorney  <-  "Performs  consuLtationj  advisory  and/or  traiL  work  and  carries 
out  the  LegaL  processes  necessary  to  effect  the  rights j  priviLegeSj  and 
obLigations  of  the  organization .  The  work  performed  requires  compLetion  of  Law 
schooL  with  an  L.L.B.  degree  or  J.D.  degree  and  admission  to  the  bar.  Responsi- 
biLities  or  functions  incLude  one  or  more  of  the  foLLowing  or  comparabLe  duties: 

1.  Preparing  and  reviewing  various  LegaL  instruments  and  documents j  such  as 
contracts j  Leases j  Licenses j  purchases j  saLeSj  reaL  estate j  etc.; 

2.  Acting  as  agent  of  the  organization  in  its  transactions ; 

3.  Examining  materiaL  (e.g.j  advertisements j  pubLicationSj  etc.)  for  LegaL 
impLications ;  advising  officiaLs  of  proposed  LegisLation  which  might  affect 
the  organization ; 

4.  AppLying  for  patents ,  copyrights }  or  registration  of  the  organization's 
productSj  processes j  deviceSj  and  trademarks;  advising  whether  to  initiate  or 
defend  Law  suits; 

5.  Conducting  pretriaL  preparations;  defending  the  organization  in  Lawsuits; 
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6.  Prosecuting  criminal  cases  for  a  Local  or  state  government  or  defending 
the  general  public  (for  examplej  public  defenders  and  attorneys  rendering 
Legal  services  to  students );  or 

7.  Advising  officials  on  tax  matterSj  government  regulations j  and/or  Legal 
rights. 

Attorney  jobs  are  matched  at  one  of  six  Levels  according  to  two  factors: 

1.  Difficulty  level  of  Legal  work;  and 

2.  Responsibility  Level  of  job. 

Attorney  jobs  which  meet  the  above  definitions  are  to  be  classified  and 
coded  in  accordance  with  a  chart  available  upon  request. 

Legal  questions  are  characterized  by:  facts  that  are  well-established ; 
clearly  applicable  Legal  precedents ;  and  matters  not  of  substantial 
importance  to  the  organization .  (Usually  relatively  Limited  sums  of  money j 
e.g.j  a  few  thousand  dollars j  are  involved . ) 

a.  Legal  investigation j  negotiation }  and  research  preparatory  to  defending 
the  organization  in  potential  or  actual  Lawsuits  involving  alleged 
negligencewhere  the  facts  can  be  firmly  established  and  there  are  precedent 
cases  directly  applicable  to  the  situation ; 

b.  searching  case  reports j  Legal  documents j  periodicals j  textbookSj  and  other 
Legal  referenceSj  and  preparing  draft  opinions  on  employee  compensation  or 
benefit  questions  where  there  is  a  substantial  amount  of  clearly  applicable 
statutory j  regulatory j  and  case  material ; 

c.  drawing  up  contracts  and  other  Legal  documents  in  connection  with  real 
property  transactions  requiring  the  development  of  detailed  information  but 
not  involving  serious  questions  regarding  titles  to  property  or  other  major 
factual  or  Legal  issues. 

d.  preparing  routine  criminal  cases  for  trial  when  the  Legal  or  factual 

issues  are  relatively  straight  forward  and  the  impact  of  the  case  is  limited; 
and 

e.  advising  public  defendants  in  regard  to  routine  criminal  charges  or  compl¬ 
aints  and  representing  such  defendants  in  court  when  legal  alternatives  and 
facts  are  relatively  clear  and  the  impact  of  the  outcome  is  Limited  primarily 
to  the  defendant . 

Legal  work  is  regularly  difficult  by  reason  of  one  or  more  of  the  following : 
the  absence  of  clear  and  directly  applicable  Legal  precedents;  the  different 
possible  interpretations  that  can  be  placed  on  the  factSj  the  LawSj  or  the 
precedents  involved;  the  substantial  importance  of  the  Legal  matters  to  the 
organization  (e.g.j  sums  as  Large  as  $100j000  are  generally  directly  or  indir¬ 
ectly  involved) ;  or  the  matter  is  being  strongly  pressed  or  contested  in 
formal  proceedings  or  in  negotiations  by  the  individuals j  corporations j  or 
government  agencies  involved. 

a.  advising  on  the  legal  implications  of  advertising  representations  when  the 
facts  supporting  the  representations  and  the  applicable  precedent  cases  are 
subject  to  different  interpretations; 

b.  reviewing  and  advising  on  the  implications  of  new  or  revised  Laws 
affecting  the  organization; 

c.  presenting  the  organization's  defense  in  court  in  a  negligence  Lawsuit 
which  is  strongly  pressed  by  counsel  for  an  organized  group; 

d.  providing  Legal  counsel  on  tax  questions  complicated  by  the  absence  of 
precedent  decisions  that  are  directly  applicable  to  the  organization's 
situation; 

e.  preparing  and  prosecuting  criminal  cases  when  the  facts  of  the  cases  are 
complex  or  difficult  to  determine  or  the  outcome  will  have  a  significant 
impact  within  the  jurisdiction;  and 
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f.  advising  and  representing  pubLic  defendants  in  all  phases  of  criminal 
proceedings  when  the  facts  of  the  case  are  complex  or  difficult  to  determine j 
complex  or  unsettled  legal  issues  are  involved j  or  the  prosecutorial 
jurisdiction  devotes  substantial  resources  to  obtaining  a  conviction . " 

xTestMachinist  <-  "Produces  replacement  parts  and  new  parts  in  making  repairs 
of  metal  parts  of  mechanical  equipment .  Work  involves  most  of  the  following : 
interpreting  written  instructions  and  specifications ;  planning  and  Laying 
out  of  work ;  using  a  variety  of  machinist's  handtools  and  precision  measuring 
instruments ;  setting  up  and  operating  standard  machine  tools ;  shaping  of 
metal  parts  to  close  tolerances ;  making  standard  shop  computations  relating 
to  dimensions  of  workj  toolingj  feedSj  and  speeds  of  machining ;  knowledge  of 
the  working  properties  of  the  common  metals ;  selecting  standard  materials j 
partSj  and  equipment  required  for  this  work ;  and  fitting  and  assembling  parts 
into  mechanical  equipment .  In  general j  the  machinist's  work  normally  requires 
a  rounded  training  in  machine-shop  practice  usually  acquired  through  a  formal 
apprenticeship  or  equivalent  training  and  experience .  Industrial  machinery 
repairer.  Repairs  machinery  or  mechanical  equipment .  Work  involves  most  of 
the  following :  examining  machines  and  mechanical  equipment  to  diagnose  source 
of  trouble ;  dismantling  or  partly  dismantling  machines  and  performing  repairs 
that  mainly  involve  the  use  of  handtools  in  scraping  and  fitting  parts; 
replacing  broken  or  defective  parts  with  items  obtained  from  stock ;  ordering 
the  production  of  a  replacement  part  by  a  machine  shop  or  sending  the  machine 
to  a  machine  shop  for  major  repairs ;  preparing  written  specifications  for 
major  repairs  or  for  the  production  of  parts  ordered  from  machine  shops ; 
reassembling  machines;  and  making  all  necessary  adjustments  for  operation . 

In  generalj  the  work  of  a  machinery  maintenance  mechanic  requires  rounded 
training  and  experience  usually  acquired  through  a  formal  apprenticeship  or 
equivalent  training  and  experience .  Excluded  from  this  classification  are 
workers  whose  primary  duties  involve  setting  up  or  adjusting  machines.  Vehicle 
and  mobile  equipment  mechanics  and  repairers .  Repairs j  rebuilds j  or  overhauls 
major  assemblies  of  internal  combustion  automobiles ,  busesj  truckSj  or  tract 
tractors.  Work  involves  most  of  the  following :  Diagnosing  the  source  of  trouble 
and  determining  the  extent  of  repairs  required;  replacing  worn  or  broken  parts 
such  as  piston  ringSj  bearingSj  or  other  engine  parts;  grinding  and  adjusting 
valves;  rebuilding  carburetors;  overhauling  transmissions;  and  repairing 
fuel  injection j  Lighting j  and  ignition  systems.  In  generalj  the  work  of  the 
motor  vehicle  mechanic  requires  rounded  training  and  experience  usually  acqu¬ 
ired  through  a  formal  apprenticeship  or  equivalent  training  and  experience" 

testJDs  <-  c(xTestAccountantj  xTestAttorneyj  xTestMachinist) 

#  define  the  preprocessing  (tolower  case)  function 

#  preproc_fun  =  tolower 

#  define  the  tokenization  function 
token_fun  =  text2vec : :word_tokenizer 

#  loop  to  substitute  with  blank  space 
for(j  in  seq(job[j  10])){ 

job  [j  j  10]  <-  gsub("_"j  "  'j  job  [j  j  10]) 

} 

#  iterator  for  lob  training  and  testing  IDs 
iter_Jobs  =  itoken(job[j  10]j 

preprocessor  =  preproc_funj 
tokenizer  =  token_funj 
progressbar  =  TRUE) 
iter_testJDs  =  itoken(testJDSj 

preprocessor  =  preproc_funj 
tokenizer  =  token_funj 
progressbar  =  TRUE) 
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jobs_Vocab=  create_vocabuLary( iter_JobSj stopwords=tm: : stopwords( "english" ) , 
ngram  =  c(lL ,  21 )) 

jobsVectorizer  =  vocab_vectorizer(jobs_Vocab) 

dtm_jobsTrain  =  create_dtm(iter_JobSj  jobsVectorizer) 

dtm_testJDs  =  create_dtm(iter_testJDSj  jobsVectorizer) 

dim( dtm_jobsTrain) ;  dim( dtm_testJDs ) 

##  [1]  200  2675 

##  [1]  3  2675 

set.seed(2) 

fitl  <-  c\j  .gimnet(x  =  as  ,matrix(dtm_jobsTrain)  j  y  =  job[  [  '  high  rank  '  ]  ] , 

family  =  '  binomial 'j 

#  lasso  penalty 
alpha  =  lj 

#  interested  in  the  area  under  ROC  curve 
type .measure  =  "auc", 

#  10-fold  cross-validation 
nfolds  =  10j 

#  high  value  is  less  accurate,  but  has  faster  training 
thresh  =  le-3j 

#  again  lower  number  of  iterations  for  faster  training 
maxit  =  le3) 

print (paste ("max  AUC  =",  round (max(fitl$cvm) ,  4))) 

##  [1]  "max  AUC  =  0.7934" 

Note  that  we  somewhat  improved  the  AUC  ~  0.79.  Below,  we  will  assess  the  JD 
predictive  model  using  the  three  out  of  bag  job  descriptions  (Fig.  20.7). 


CV  LASSO:  Number  of  Nonzero  (Active)  Coefficients 
251  256  255  255  255  253  249  244  234  210  172  13  4 


Fig.  20.7  AUC-based  performance  of  the  cross-validated  LASSO-regularized  model  of 
job-ranking  based  on  dtm_j  obsTrain,  see  Figs.  20.5  and  20.6 
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pLot(fitl ) 

#  plot(fitl,  xvar="lambda" ,  label="TRUE" ) 

mtext("CV  LASSO:  Number  of  Nonzero  (Active)  Coefficients" j  side=3j  Line=2.5) 

predTestJDs  <-  predict (fitl}  s  =  fitl$Lambda . lse, 

neux  =  dtm_testJDSj  type="response" ) ;  predTestJDs 


##  1 
##  1  0.2153011 
##  2  0.2200925 
##  3  0.1257575 

predTrainJDs  <-  predict(fitlj  s  =  fitl$Lambda . lse}  newx  =  dtm_jobsTrainj  typ 
e="response" ) ;  predTrainJDs 

##  1 
##  1  0.3050636 

##  2  0.4118190 

##  3  0.1288656 

##  4  0.1493051 

##  5  0.6432706 

##  6  0.1257575 

##  7  0.1257575 

##  8  0.2561290 

##  9  0.3866247 

##  10  0.1262752 

##  196  0.1257575 
##  197  0.1257575 
##  198  0.1257575 
##  199  0.1257575 
##  200  0.1257575 


#  Type  can  be:  "link",  "response". 


"coefficients",  "class",  "nonzero" 


The  output  of  the  predictions  shows  that: 

•  On  the  training  data ,  the  predicted  probabilities  rapidly  decrease  with  the 
indexing  of  the  jobs,  corresponding  to  the  overall  job  ranking  (highly  ranked/ 
desired  jobs  are  listed  on  the  top). 

•  On  the  three  testing  job  description  data  (accountant,  attorney,  and  machinist), 
there  is  a  clear  ranking  difference  between  the  machinist  and  the  other  two 
professions. 

Also  see  the  discussion  in  Chap.  18  about  the  different  types  of  predictions  that 
can  be  generated  as  outputs  of  cv .  glmnet  regularized  forecasting  methods. 
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20.4  Cosine  Similarity 

As  we  mentioned  above,  text  data  are  often  transformed  in  terms  of  Term  Frequency- 
Inverse  Document  Frequency  (TF-IDF),  which  offers  a  better  input  than  the  raw 
frequencies  for  many  text-mining  methods.  An  alternative  transformation  can  be 
represented  as  a  different  distance  measure  such  as  the  cosine  distance ,  which  is  defined 
by: 


similarity  =  cos 

where  6  represents  the  angle  between  the  pair  of  vectors  A  and  B  in  the  Euclidean 
space  spanned  by  the  DTM  matrix  (Fig.  20.8). 


cos_dist  =  f unction (mat) { 
numer  =  tcrossprod(mat) 
denoml  =  sqrt(apply(mat}  1}  crossprod) ) 
denom2  =  sqrt(apply(mat}  1}  crossprod) ) 

1  -  numer  /  outer (denoml jdenom2) 

} 

dist_cos  =  cos_dist(as.matrix(dtm) ) 
set.seed(2000) 

fit_cos  <-  cv.gLmnet(x  =  dist_coSj  y  =  job[ [  ' highrank  ' ] ] , 

family  =  '  binomial' j 
#  lasso  penalty 
alpha  =  lj 


57  57  55  53  52  48  42  35  33  30  28  27  21  16  7  1 

tr> 


Fig.  20.8  AUC-based  performance  of  the  cross-validated  LASSO-regularized  model  of 
job-ranking  based  on  cosine- similarity  distance  (dist_cos),  see  Figs.  20.5,  20.6,  and  20.7 


686 


20  Natural  Language  Processing/Text  Mining 


#  interested  in  the  area  under  ROC  curve 
type. measure  =  "auc"j 

#  10-fold  cross-validation 
nfoLds  =  10j 

#  high  value  is  less  accurate,  but  has  faster  training 
thresh  =  le-3} 

#  again  lower  number  of  iterations  for  faster  training 
maxit  =  le3) 


pLot(fit_cos) 

print (paste (" max  AUC  =",  round(max(fit_cos$cvm)j  4))) 

##  [1]  "max  AUC  =  0.8065" 

The  AUC  now  is  greater  than  0.8,  which  is  a  pretty  good  result;  even  better  than 
what  we  obtained  from  DTM  or  TF-IDF.  This  suggests  that  our  machine  “under¬ 
standing”  of  the  textual  content,  i.e.,  the  natural  language  processing,  leads  to  a  more 
acceptable  content  classifier. 


20.5  Sentiment  Analysis 

Let’s  use  the  text2vec:  :movie_review  dataset,  which  consists  of  5,000 
movie  reviews  dichotomized  as  positive  or  negative.  In  the  subsequent 
predictive  analytics,  this  sentiment  will  represent  our  output  feature: 


negative 

positive 


20.5.1  Data  Preprocessing 

The  data  .table  package  will  also  be  used  for  some  data  manipulation.  Let’s  start 
with  splitting  the  data  into  training  and  testing  sets. 

#  install . packages("text2vec" ) ;  install .packages ("data .table" ) 

Library ( textlvec) 

Library (data.  tabLe) 

#  Load  the  movie  reviews  data 
data(  "movie_revieiAj"  ) 

#  coerce  the  movie  reviews  data  to  a  data. table  (DT)  object 
setDT(movie_review) 

#  create  a  key  for  the  movie-reviews  data  table 
setkey(movie_revieWj  id) 
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#  View  the  data 

#  View(movie_review) 

head(movie_review) ;  dim(movie_review) ;  colnames(movie_reviek i) 
##  ic/  sentiment 


##  1:  10000_8  1 

##  2:  10001_4  0 

##  3:  10004_3  0 

##  4:  10004_8  1 

##  5:  10006 _4  0 

##  6:  10008 _7  1 

## 

review 


##  1:  Homelessness  (or  Houselessness  as  George  Carlin  stated)  has  been  an  is 
sue  for  years  but  never  a  plan  to  help  those  on  the  street  that  were  once  co 

nsidered  human  who  did  everything  from  going  to  school  . . .  Maybe  they  should 

give  it  to  the  homeless  instead  of  using  it  like  Monopoly  money. <br  /xbr  /> 
Or  maybe  this  film  will  inspire  you  to  help  others. 

##  2: 

This  film  Lacked  something  I  couldn't  put  my  finger  on  at  first:  charisma  on 
the  part  of  the  Leading  actress.  This  inevitably  translated  to  lack  of  chemi 
stry  when  she  shared  the  screen  with  her  Leading  man.  Even  the  romantic  seen 
es  came  across  as  being  merely  the  actors  at  play  ...  I  was  disappointed  in  th 

is  movie.  But,  don't  forget  it  was  nominated  for  an  Oscarj  so  judge  for  your 

self. 

##  3: 

\\"It  appears  that  many  critics  find  the  idea  of  a  Moody  Allen  drama  unpalat 
able. \\"  And  for  good  reason:  they  are  unbearably  wooden  and  pretentious  imi 
tations  of  Bergman.  And  Let's  ...  W'ripping  off\\"  Hitchcock  in  his  suspense/ 
horror  films?  In  Robin  Mood's  vieWj  it's  a  strange  form  of  cultural  snobbery 
.  I  would  have  to  agree  with  that. 

##  4: 

This  isn't  the  comedic  Robin  Milliams,  nor  is  it  the  quirky/insane  Robin  Mil 
Liams  of  recent  thriller  fame.  This  is  a  hybrid  of  the  classic  drama  without 
over-dramatization,  mixed  with  Robin's  new  love  of  the  thriller.  But  this  is 
n't  a  thrillerj  per  se  ...  <br  /xbr  />All  in  all,  it's  worth  a  watch,  though 
it's  definitely  not  Friday/Saturday  night  farexbr  /xbr  />It  rates  a  7.7/10 
from...<br  /xbr  />the  Fiend  :. 

##  5: 

I  don't  know  who  to  blame ,  the  timid  writers  or  the  clueless  director.  It  se 
emed  to  be  one  of  those  movies  where  so  much  was  paid  to  the  stars  (Angie,  C 
harlie,  Denise,  Rosanna  and  Jon)  ...  If  they  were  only  Looking  for  Laughs  why 
not  cast  Mhoopi  Goldberg  and  Judy  Tenuta  instead?  This  was  so  predictable  I 
was  surprised  to  find  that  the  director  wasn't  a  five  year  old.  Mhat  a  waste 
,  not  just  for  the  viewers  but  for  the  actors  as  well. 

##  6: 

You  know,  Robin  Milliams,  God  bless  him,  is  constantly  shooting  himself  in  t 
he  foot  Lately  with  all  these  dumb  comedies  he  has  done  this  decade  (with  pe 
rhaps  the  exception  of  W'Death  To  Smoochy  ...  It's  incredible  that  there  is  a 
t  Least  one  woman  in  this  world  who  is  like  this,  and  it's  scary  tooxbr  />< 
br  />This  is  a  good,  dark  film  that  I  highly  recommend .  Be  prepared  to  be  un 
settled,  though,  because  this  movie  Leaves  you  with  a  strange  feeling  at  the 
end. 

##  [1]  5000  3 

##  [1]  "id"  "sentiment"  "review" 
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#  Generate  80-20%  training-testing  split  of  the  reviews 

aLL_ids  =  movie_review$id 

set.seed(1234) 

train_ids  =  sampLe (aii_idSj  5000*0.8) 
test_ids  =  setdiff(oLL_idSj  train_ids) 
train  =  movie_review[train_ids,  ] 
test  =  movie_review[test_idSj  ] 


Next,  we  will  vectorize  the  reviews  by  creating  terms  to  termID  mappings.  Note 
that  terms  may  include  arbitrary  n-grams ,  not  just  single  words.  The  set  of  reviews 
will  be  represented  as  a  sparse  matrix,  with  rows  and  columns  corresponding  to 
reviews/reviewers  and  terms,  respectively.  This  vectorization  may  be  accomplished 
in  several  alternative  ways,  e.g.,  by  using  the  corpus  vocabulary,  feature 
hashing,  etc. 

The  vocabulary-based  DTM,  created  by  the  create_vocabulary  ( )  func¬ 
tion,  relies  on  all  unique  terms  from  all  reviews,  where  each  term  has  a  unique 
ID.  In  this  example,  we  will  create  the  review  vocabulary  using  an  iterator  construct 
abstracting  the  input  details  and  enabling  in  memory  processing  of  the  (training)  data 
by  chunks. 


#  define  the  text  preprocessing 

#  either  a  simple  (tolower  case)  function 
preproc_fun  =  to  Lower 


#  or  a  more  elaborate  "cleaning"  function 

preproc_fun  =  f unction (x)  #  text  data 

{  require("tm") 

x  =  gsub("< .*?>" j  "  ",  x)  #  regex  removing  HTML  tags 

x  =  iconv(x,  "Latinl'y  "ASCII" ,  sub="")  #  remove  non-ASCII  characters 
x  =  gsub("[A[ :aLnum: ]]",  "  ",  x)  #  remove  non-alpha-numeric  values 
x  =  toLower(x)  #  convert  to  lower  case  characters 


#  x  =  removeNumbers(x) 
x  =  stripWhitespace(x) 
x  =  gsub(  "A\\s+ 1  \  \  s+$  ",  x) 


#  removing  numbers 

#  removing  white  space 

#  remove  leading  and  trailing  white  space 


return (x) 

} 


#  define  the  tokenization  function 
token_fun  =  word_tokenizer 

#  iterator  for  both  training  and  testing  sets 
iter_train  =  itoken(train$review , 

preprocessor  =  preproc_fun , 
tokenizer  =  token_fun, 
ids  =  train$id, 
progressbar  =  TRUE) 
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iter_test  =  itoken(test$revieWj 

preprocessor  =  preproc_furij 
tokenizer  =  token_fun} 
ids  =  test$idj 
progressbar  =  TRUE) 

reviewVocab  =  create_vocabulary(iter_train) 

#  report  the  head  and  tail  of  the  reviewVocab 
reviewVocab 

##  Number  of  docs:  4000 
##  0  stopwords :  ... 

##  ngram_min  =  1;  ngram_max  =  1 
##  Vocabulary : 


## 

terms 

terms_counts 

doc_counts 

## 

1 

Lowlife 

1 

1 

## 

2 

sorin 

1 

1 

## 

3 

ewell 

1 

1 

## 

4 

negligence 

1 

1 

## 

5 

stribor 

1 

1 

## 

— 

## 

35661 

Landscaping 

1 

1 

## 

35662 

bikes 

1 

1 

## 

35663 

primer 

1 

1 

## 

35664 

Loosely 

26 

25 

## 

35665 

cycling 

1 

1 

Next,  we  can  compute  the  document  term  matrix  (DTM). 

reviewVectorizer  =  vocab_vectorizer( reviewVocab) 
t0  =  Sys.time() 

dtm_train  =  create_dtm(iter_trairij  reviewVectorizer) 
dtm_test  =  create_dtm(iter_testj  reviewVectorizer) 
tl  =  Sys.time() 

print (diff time (tlj  t0,  units  =  'sec')) 

##  Time  difference  of  3.844368  secs 

#  check  the  DTM  dimensions 
dim (dtm_t rain ) ;  dim(dtm_test) 

##  [1]  4000  35665 

##  [1]  1000  35665 

#  confirm  that  the  training  data  review  DTM  dimensions  are  consistent 

#  with  training  review  IDs,  i.e.,  #r ows  =  number  of  documents,  and 

#  #columns  =  number  of  unique  terms  (n-grams),  dim(dtm_train) [ [2] ] 
identical ( rownames(dtm_train) ,  train$id) 

##  [1]  TRUE 


20.5.2  NLP/TM  Analytics 

We  can  now  fit  statistical  models  or  derive  machine  learning  model-free  predictions. 
Let’s  start  by  using  glmnet  ( )  to  fit  a  logit  model  with  LASSO  (Lj)  regularization 
and  10-fold  cross-validation,  see  Chap.  18  (Fig.  20.9). 
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CV  LASSO:  Number  of  Nonzero  (Active)  Coefficients 
2255  2110  1370  1511  959  461  184  90  56  22  1 1  5  2 


Fig.  20.9  AUC-based  performance  of  the  cross-validated  LASSO-regularized  model  of  movie 
sentiment  analysis  training  data  (dtm_train),  see  Fig.  20.8 

Library (gimnet) 
nFoLds  =  10 
t0  =  Sys.time() 

gLmnet_cLassifier  =  cv.glmnet(x  =  dtm_train}  y  =  train [[' sentiment  ]], 
family  =  "binomial" j 

#  LASSO  LI  penalty 
alpha  =  lj 

#  interested  in  the  area  under  ROC  curve  or  MSE 
type .measure  =  "auc"j 

#  n-fold  internal  (training  data)  stats  cross-validation 
n folds  =  nFoldSj 

#  threshold:  high  value  is  less  accurate  /  faster  training 
thresh  =  le-2} 

#  again  lower  number  of  iterations  for  faster  training 
maxit  =  le 3 

) 

Lambda . best  <-  glmnet_classifier$lambda .min 
Lambda . best 

##  [1]  0.007344319 

#  report  execution  time 
tl  =  Sys.time() 

print (diff time (tl,  t0 ,  units  =  'sec')) 

##  Time  difference  of  5.923289  secs 

#  some  prediciton  plots 
plot(glmnet_classifier) 

#  plot(glmnet_classif ier ,  xvar="lambda" label="TRUE" ) 

mtext(  'CV  LASSO:  Number  of  Nonzero  (Active)  Coefficients ",  side=3}  Line=2.5) 

Now  let’s  look  at  external  validation,  i.e.,  testing  the  model  on  the  independent 
20%  of  the  reviews  we  kept  aside.  The  performance  of  the  binary  prediction  (binary 
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sentiment  analysis  of  these  movie  reviews)  on  the  test  data  is  roughly  the  same  as  we 
had  from  the  internal  statistical  10-fold  cross-validation. 

#  report  the  mean  internal  cross-validated  error 

print (paste (" max  AUC  =  ",  round(max(gLmnet_cLassifier$cvm) ,  4))) 

##  [1]  "max  AUC  =  0.9246" 

#  report  TESTING  data  prediction  accuracy 
xTest  =  dtm_test 

yTest  =  test [[' sentiment '] ] 
predLASSO  <-  predict (g Lmnet_c Lassifier, 

s  =  gLmnet_cLassifier$Lambda. lse,  newx  =  xTest) 
testMSE_LASSO  <-  mean( (predLASSO  -  yTest) *2);  tes tMSE_ LASSO 

##  [1]  2.869523 

#  Binarize  the  LASSO  probabiliuty  prediction 
binPredLASSO  <-  ifeLse(predLASSO<0.5J  0,1) 
tabie(binPredLASSOj  yTest) 

##  yTest 

##  binPredLASSO  0  1 

##  0  455  152 

##  1  40  353 

#  and  testing  data  AUC 
gLmnet: : :auc(yTest ,  predLASSO) 

##  [1]  0.9175598 

#  report  the  top  20  negative  and  positive  predictive  terms 
summary ( predLASSO ) 

##  1 
##  Min.  :-12.80392 
##  1st  Qu.:  -0.94586 
##  Median  :  0.14755 

##  Mean  :  -0.07101 
##  3rd  Qu.:  1.01894 
##  Max.  :  6.60888 

sort (predict .cv. gLmnet (gLmnet_cLassifier,  s  =  Lambda. best,  type  =  "coefficie 
nts ") ) [1:20] 

##  <sparse>[  < Logic >  ]  :  .M. sub.i.  LogicaL()  maybe  inefficient 

##  [1]  -4.752272  -2.199304  -1.987171  -1.966585  -1.902009  -1.866655  -1.84496 

6 

##  [8]  -1.750693  -1.518148  -1.502966  -1.436081  -1.405963  -1.349566  -1.34285 

6 

##  [15]  -1.320218  -1.283414  -1.270231  -1.257663  -1.242869  -1.209449 

rev (sort (predict. cv. gLmnet (gLmnetcLassifier,  s  =  Lambda. best ,  type  =  "coeff 
icients ") ) ) [1:20] 

##  <sparse>[  < Logic >  ]  :  ,M. sub.i. LogicaL()  maybe  inefficient 

##  [1]  2.559487  2.416333  2.101371  1.913529  1.899846  1.684176  1.600367 

##  [8]  1.530320  1.519663  1.435103  1.430446  1.376056  1.343108  1.309902 

##  [15]  1.300156  1.287921  1.131859  1.078685  1.074059  1.015887 

The  (external)  prediction  performance,  measured  by  AUC,  on  the  testing 
data  is  about  the  same  as  the  internal  10-fold  stats  cross-validation  we  reported  above. 
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Pruned -Model  CV  LASSO:  Number  of  Nonzero  (Active)  Coefficients 
1556  1434  1250  970  697  417  229  105  54  20  10  5  2 


Fig.  20.10  AUC-based  performance  of  the  pruned  cross-validated  LASSO-regularized  model  of 
movie  sentiment  analysis  (glmnet_prunedClassifier),  see  Fig.  20.9 


20.5.3  Prediction  Optimization 

Earlier,  we  saw  that  we  can  also  prune  the  vocabulary  and  perhaps  improve 
prediction  performance,  e.g.,  by  removing  non-salient  terms  like  stop  words  and  by 
using  n-grams  instead  of  single  words  (Fig.  20.10). 


reviewVocab  =  create_vocabuiary(iter_trainj 
stop\Ajords=tm:  :stopwords("engLish")j  ngram  =  c(lL ,  2L)) 

prunedReviewVocab  =  prune_vocabuLary(reviewVoccibj 

term_count_min  =  10} 
doc_proportion_max  =  0.5} 
doc_proportion_min  =  0.001) 

prunedVectorizer  =  vocab_vectorizer(prunedReviewVocab) 
t0  =  Sys.time() 

dtm_train  =  create_dtm(iter_trairij  prunedVectorizer) 
dtm_test  =  create_dtm(iter_testj  prunedVectorizer) 
tl  =  Sys.time() 

print (dif f time (tlj  t0,  units  =  'sec')) 

##  Time  difference  of  3.778152  secs 

Next,  let’s  refit  the  model  and  report  its  performance.  Would  there  be  an 
improvement  in  the  prediction  accuracy? 
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gLmnet_prunedCLassifier=cv.gLmnet(x=dtm_trainj 
y=troin[ [ 'sentiment ' JJ, 
family  =  "binomial" j 

#  LASSO  LI  penalty 
alpha  =  lj 

#  interested  in  the  area  under  ROC  curve  or  MSE 
type. measure  =  "auc"j 

#  n-fold  internal  (training  data)  stats  cross-validation 
n folds  =  nFoldSj 

#  threshold:  high  value  is  less  accurate  /  faster  training 
thresh  =  le-2} 

#  again  lower  number  of  iterations  for  faster  training 
maxit  =  le 3 

) 

Lambda . best  <-  glmnet_prunedCLassifier$ Lambda .min 
Lambda . best 

##  [1]  0.005555708 

#  report  execution  time 
tl  =  Sys.time( ) 

print (dif f time (tlj  t0j  units  =  'sec')) 

##  Time  difference  of  6.978195  secs 

#  some  prediction  plots 
plot(glmnet_prunedClassifier) 

mtext( "Pruned -Mode l  C V  LASSO:  Number  of  Nonzero  (Active)  Coefficients 
side=3j  Line=2.5) 

#  report  the  mean  internal  cross-validated  error 

print (paste (" max  AUC  =",  round(max(glmnet_prunedCLassifier$cvm)j  4))) 
##  [1]  "max  AUC  =  0.9295" 

#  report  TESTING  data  prediction  accuracy 
xTest  =  dtm_test 

yTest  =  test [[' sentiment '] ] 
predLASSO  =  predict (glmnet_prunedClassifier, 
dtm_testj  type  =  ' response ')[ j 1] 

testMSE_LASSO  <-  mean ((predLASSO  -  yTest)*2);  tes tMSE_ LASSO 
##  [1]  0.1194583 

#  Binarize  the  LASSO  probabiliuty  prediction 
binPredLASSO  <-  ifeLse(predLASSO<0.5j  0,1) 
table (binPredLASSOj  yTest) 

##  yTest 

##  binPredLASSO  0  1 

##  0  416  60 

##  1  79  445 

#  and  testing  data  AUC 
gLmnet: : :auc(yTest}  predLASSO) 

##  [1]  0.9252405 

#  report  the  top  20  negative  and  positive  predictive  terms 
summary ( predLASSO ) 
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##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  0.0000014  0.2518000  0.5176000  0.5026000  0.7604000  0.9995000 

sort (predict .cv.gimnet(gLmnet_cLassifier,  s  =  Lambda. best,  type  =  "coefficie 
nts ") ) [1:20] 

##  <sparse>[  <Logic>  ]  :  .M.sub.i. LogicaL()  maybe  inefficient 

##  [1]  -5.695082  -2.774694  -2.756099  -2.540456  -2.508213  -2.474586  -2.432767 

##  [8]  -2.429874  -1.999731  -1.941299  -1.934803  -1.929788  -1.819220  -1.774936 

##[15]  -1.765978  -1.737596  -1.717957  -1.661592  -1.611752  -1.599558 

rev (sort (predict .cv.gLmnet(gLmnet_cLassifier,  s  =  Lambda. best,  type  =  "coeff 
icients ") ) ) [1:20] 

##  <sparse>[  < Logic >  ]  :  .M.sub.i.  LogicaL()  maybe  inefficient 

##  [1]  3.276620  2.695083  2.575524  2.436630  2.366057  2.139067  2.087892 

##  [8]  2.027113  1.980694  1.894909  1.839621  1.777573  1.743082  1.599660 

##  [15]  1.579711  1.569817  1.533461  1.509555  1.453862  1.425065 

•  Binarize  the  LASSO  probability  prediction 

•  and  construct  an  approximate  confusion  matrix 
binPredLASSO  <-  ifeLse(predLASSO<0.5J  0,1) 
table (binPredLASSO,  yTest) 

##  yTest 

##  binPredLASSO  0  1 

##  0  416  60 

##  1  79  445 

Using  n- grams  improved  a  bit  the  sentiment  prediction  model. 

Try  these  NLP  techniques  to  other  data  like: 

•  MIMIC-III,  a  freely  accessible  critical  care  database.  Johnson  AEW,  Pollard  TJ, 
Shen  L,  Lehman  L,  Feng  M,  Ghassemi  M,  Moody  B,  Szolovits  P,  Celi  LA,  and 
Mark  RG.  Scientific  Data  (2016).  DOI:  10.1038/sdata.2016.35.  Available  from: 
http://www.nature.com/articles/sdata20 1 635 

•  Other  data  from  the  list  of  our  Case-Studies. 

•  Your  own  free  text. 


20.6  Assignment:  20.  Natural  Language  Processing/Text 
Mining 

20.6.1  Mining  Twitter  Data 

Use  these  R  Data  Mining  Twitter  data  to  apply  NLP/TM  methods  and  investigate  the 
Twitter  corpus. 

•  Construct  a  VCorpus  object 

•  Clean  the  VCorpus  object 
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•  Build  document  term  matrix  (DTM) 

•  Compute  the  TF-IDF(term  frequency  -  inverse  document  frequency 

•  Use  the  DTM  to  construct  a  wordcloud. 


20.6.2  Mining  Cancer  Clinical  Notes 

Use  Head  and  Neck  Cancer  Medication  Data  to  apply  NLP/TM  methods  and 
investigate  the  corpus.  You  have  already  seen  this  data  in  Chap.  8;  now  we  can  go 
a  step  further. 

•  Use  MED  I  CAT  I  ON_SUMMARY  to  construct  a  VCorpus  object. 

•  Clean  the  VCorpus  object. 

•  Build  the  document  term  matrix  (DTM). 

•  Add  a  column  to  indicate  early  and  later  stage  according  to  seer_stage  (refer 
to  Chap.  8). 

•  Use  the  DTM  to  construct  a  word  cloud  for  early  stage,  later  stage  and  whole. 

•  Interpret  according  to  the  word  cloud. 

•  Compute  the  TF-IDF  (Term  Frequency  -  Inverse  Document  Frequency). 

•  Apply  LASSO  on  the  unweighted  and  weighted  DTM  respectively  and  evaluate 
the  results  according  to  AUC. 

•  Try  cosine  similarity  transformation,  apply  LASSO  and  compare  the  result. 

•  Use  other  measures  such  as  “class”  for  cv .  glmnet  ( ) . 

•  Does  it  appear  that  these  classifiers  understand  well  human  language? 
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Chapter  21 

Prediction  and  Internal  Statistical  Cross 
Validation 


® 

Check  for 
updates 


We  should  start  by  reviewing  Chap.  14  (Model  Performance  Assessment).  Cross- 
validation  is  a  statistical  approach  for  validating  predictive  methods,  classification 
models,  and  clustering  techniques.  It  assesses  the  reliability  and  stability  of  the 
results  of  the  corresponding  statistical  analyses  (e.g.,  predictions,  classifications, 
forecasts)  based  on  independent  datasets.  For  prediction  of  trend,  association, 
clustering,  and  classification,  a  model  is  usually  trained  on  one  dataset  ( training 
data)  and  subsequently  tested  on  new  data  ( testing  or  validation  data).  Statistical 
internal  cross-validation  uses  iterative  bootstrapping  to  define  test  datasets,  evaluates 
the  model  predictive  performance,  and  assesses  its  power  to  avoid  overfitting. 
Overfitting  is  the  process  of  computing  a  predictive  or  classification  model  that 
describes  random  error,  i.e.,  fits  to  the  noise  components  of  the  observations,  instead 
of  the  actual  underlying  relationships  and  salient  features  in  the  data. 

In  this  Chapter,  we  will  use  the  Google  Flu  Trends,  Autism,  and  Parkinson’s 
disease  case-studies  to  (1)  illustrate  exhaustive  and  non-exhaustive  internal  statisti¬ 
cal  cross-validation;  (2)  explore  alternative  forecasting  types  using  linear  and  non¬ 
linear  predictions;  and  (3)  compare  complementary  predictor  functions. 


21.1  Forecasting  Types  and  Assessment  Approaches 

In  Chap.  7,  we  discussed  the  types  of  classification  and  prediction  methods,  includ¬ 
ing  supervised  and  unsupervised  learning.  The  former  are  direct  and  pre¬ 
dictive,  as  there  are  known  outcome  variables  that  can  be  predicted,  and  the 
corresponding  forecasts  can  be  evaluated.  The  latter  are  indirect  and  descriptive,  as 
there  are  no  a  priori  labels  or  specific  outcomes. 

There  are  alternative  metrics  used  for  evaluation  of  model  performance,  see 
Chap.  14.  For  example,  assessment  of  supervised  prediction  and  classification 
methods  depends  on  the  type  of  the  labeled  outcome  responses:  categorical 
(binary  or  polytomous)  vs.  continuous. 
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•  Confusion  matrices  reporting  accuracy,  FP,  FN,  PPV,  NPV,  LOR  and 
other  metrics  may  be  used  to  assess  predictions  of  dichotomous  (binary)  or 
polytomous  outcomes. 

•  R  ,  correlations  (between  predicted  and  observed  outcomes),  and  RMSE  mea¬ 
sures  may  be  used  to  quantify  the  performance  of  various  supervised  forecasting 
methods  on  continuous  features. 


21.2  Overfitting 

Before  we  go  into  the  cross-validation  of  predictive  analytics,  we  will  present  several 
examples  of  overfitting  that  illustrate  why  a  certain  amount  of  skepticism  and 
mistrust  may  be  appropriate  when  dealing  with  forecasting  models  that  are  based 
on  large  and  complex  data. 


21.2.1  Example  (US  Presidential  Elections ) 

By  2017,  there  were  only  57  US  presidential  elections  and  45  presidents.  That  is  a 
small  dataset,  and  learning  from  it  may  be  challenging.  For  instance: 

•  If  the  predictor  space  expands  to  include  things  like  having  false  teeth ,  it’s  pretty 
easy  for  the  model  to  go  from  fitting  the  generalizable  features  of  the  data  (the 
signal,  e.g.,  presidential  actions)  to  matching  noise  patterns  (e.g.,  irrelevant 
characteristics  like  gender  of  the  children  of  presidents,  or  types  of  dentures 
they  may  wear). 

•  When  overfitting  noise  patterns  takes  place,  the  quality  of  the  model  fit  assessed 
on  the  historical  data  may  improve  (e.g.,  better  R  ,  more  about  the  Coefficient  of 
Determination  is  available  online).  At  the  same  time,  however,  the  model  perfor¬ 
mance  may  be  suboptimal  when  used  to  make  inference  about  prospective  data, 
e.g.,  future  presidential  elections. 

Figure  21.1  shows  a  cartoon  that  includes  some  of  the  (unique)  noisy  presidential 
characteristics  that  are  thought  to  be  unimportant  to  electability,  fitness  for  office,  or 
expectations  of  presidential  performance. 


21.2.2  Example  (Google  Flu  Trends ) 

A  March  14,  2014  article  in  Science  (DOI:  https://doi.org/10.1126/science. 
1248506),  identified  problems  in  a  Google  Flu  Trends  (GFT)  study,  DOI  https:// 
doi.org/10.1371/joumal.pone.00236 10,  which  may  be  attributed  in  part  to 
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Fig.  21.1  Example  of  an  overfitting  based  on  extreme  stratification  of  traits  of  presidential 
candidates 


700 


21  Prediction  and  Internal  Statistical  Cross  Validation 


overfitting.  The  GFT  model  was  built  to  predict  the  future  Centers  for  Disease 
Control  and  Prevention  (CDC)  reports  of  doctor  office  visits  for  influenza-like  illness 
(ILI).  In  February  2013,  Nature  reported  that  GFT  was  predicting  more  than  double 
the  proportion  of  doctor  visits  compared  to  the  CDC  forecast  for  the  same  period. 

The  GFT  model  found  the  best  matches  among  50  million  web  search  terms  to  fit 
1,152  data  points.  It  predicted  quite  high  odds  of  finding  search  terms  that  match  the 
propensity  of  the  flu,  but  which  are  structurally  unrelated,  and  hence  are  not 
prospectively  predictive.  In  fact,  the  GFT  investigators  reported  weeding  out  sea¬ 
sonal  search  terms  that  were  unrelated  to  the  flu,  which  may  have  been  strongly 
correlated  to  the  CDC  data,  e.g.,  high  school  basketball  season.  The  big  GFT  data 
may  have  overfitted  the  relatively  small  number  of  cases.  This  false-alarm  result  was 
also  paired  with  a  false-negative  finding.  The  GFT  model  also  missed  the 
non-seasonal  2009  H1N1  influenza  pandemic.  This  provides  a  cautionary  tale 
about  prediction,  overfitting,  and  prospective  validation. 


21.2.3  Example  ( Autism ) 

Autistic  brains  constantly  overfit  visual  and  cognitive  stimuli.  To  an  autistic 
person,  a  general  conversation  of  several  adults  may  seem  like  a  cacophony  due 
to  super-sensitive  detail-oriented  hearing  and  perception  tuned  to  literally  pick  up 
all  elements  of  the  conversation  and  clues  of  the  surrounding  environment.  At  the 
same  time,  autistic  brains  may  downplay  body  language,  sarcasm,  and  non-literal 
cues.  We  can  miss  the  forest  for  the  trees  when  we  start  “overfitting”, 
i.e.,  when  we  over  interpret  the  noise  on  top  of  the  actual  salient  information. 
Ambient  noise,  trivial  observations,  and  unrelated  perceptions  may  obfuscate  the 
true  communication  details. 

Human  conversations  and  communications  involve  exchanges  of  both  critical 
information  and  random  noise.  Fitting  a  perfect  model  requires  focus  only  on  the 
“relevant”  information.  Overfitting  occurs  when  attention  is  (excessively)  consumed 
with  peripheral  noise,  or  worse,  overwhelmed  by  inconsequential  noise  drowning 
the  salient  aspects  of  the  communication  exchange. 

Any  dataset  is  a  mix  of  signal  and  noise.  The  main  task  of  our  brains  is  to  sort 
these  components  and  interpret  the  useful  information  while  ignoring  the  noise. 
However,  we  should  be  cognizant  that 

"One  person's  noise  is  another  person's  treasure  map!" 

Our  predictions  are  most  accurate  if  we  can  model  as  much  of  the  signal  and  as 
little  of  the  noise  as  possible.  Note  that  in  these  terms,  R~  is  a  poor  metric  to  identify 
predictive  power  -  it  measures  how  much  of  the  signal  and  the  noise  is  explained  by 
our  model.  In  practice,  it’s  hard  to  always  identify  what’s  signal  and  what’s  noise. 
This  is  why  practical  applications  tend  to  favor  simpler  models,  since  the  more 
complicated  a  model  is,  the  easier  it  is  to  overfit  the  noise  component  of  the  observed 
information. 


21.3  Internal  Statistical  Cross-Validation  is  an  Iterative  Process 
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21.3  Internal  Statistical  Cross-Validation  is  an  Iterative 
Process 

Internal  statistical  cross-validation  assesses  the  expected  performance  of  a  prediction 
method  in  cases  (e.g.,  subjects,  units,  regions,  etc.)  drawn  from  a  similar  population 
as  the  original  training  data  sample.  Internal  validation  is  distinct  from  exter¬ 
nal  validation,  as  the  latter  potentially  allows  for  the  existence  of  differences 
between  the  populations:  training  data,  used  to  develop,  or  train,  the  technique, 
and  testing  data,  used  to  independently  quantify  the  performance  of  the  technique. 
Each  step  in  the  internal  statistical  cross-validation  protocol  involves: 

•  Randomly  partitioning  a  sample  of  data  into  2  complementary  subsets  (training  + 
testing), 

•  Performing  the  analysis,  fitting  or  estimating  the  model  using  the  training  set, 

•  Validating  the  analysis  or  evaluating  the  performance  of  the  model  using  the 
separate  testing  set, 

•  Increasing  the  iteration  index  and  repeating  the  process.  Various  termination 
criterial  can  be  chosen  like  a  fixed  number  of  iterations,  a  desired  mean  variabil¬ 
ity,  or  an  upper  bound  on  the  error-rate. 

One  example  of  internal  statistical  cross-validation  used  for  predictive  diagnostic 
modeling  in  Parkinson’s  disease  is  available  online. 

To  reduce  the  noise  and  variability  at  each  iteration,  the  final  validation  results 
may  include  the  averaged  performance  results  across  iterations. 

In  cases  when  new  observations  are  hard  to  obtain  (due  to  costs,  reliability,  time, 
or  other  constraints),  cross-validation  guards  against  testing  hypotheses  suggested 
by  the  data  themselves  (also  known  as  Type  III  error  or  False-Suggestion). 

Cross-validation  is  different  from  conventional -validation  (e.g.  80-20% 
partitioning  the  data  set  into  training  and  testing  subsets)  where  the  prediction 
error  (e.g.,  Root  Mean  Square  Error,  RMSE)  evaluated  on  the  training  data  is  not  a 
useful  estimator  of  model  performance,  as  it  does  not  generalize  across  multiple 
samples. 

In  general,  the  errors  of  the  conventional-valuation  are  based  on  the  results  of 
a  specific  test  dataset  and  may  not  accurately  represent  the  model  performance. 
A  more  appropriate  strategy  to  properly  estimate  model  prediction  performance  is  to 
use  cross-validation  (CV),  which  combines  (e.g.,  averages)  multiple  prediction 
errors  to  measure  the  expected  model  performance.  CV  corrects  for  the  expected 
stochastic  nature  of  partitioning  the  training  and  testing  sets  and  generates  a  more 
accurate  and  robust  estimate  of  the  expected  model  performance. 

Relative  to  a  simpler  model,  a  more  complex  model  may  overfit-the-data  if  it 
has  a  short  foresight,  i.e.,  it  may  generate  accurate  fitting  results  for  known  data  but 
less  accurate  results  when  predicting  based  on  new  data.  Knowledge  from  past 
experiences  may  include  either  relevant  or  irrelevant  (noise)  information.  In  chal¬ 
lenging  data-driven  prediction  models  where  the  uncertainty  (entropy)  is  high,  more 
noise  is  present  in  past  information  that  needs  to  be  accounted  for  in  prospective 
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forecasting.  However,  it  is  generally  hard  to  discriminate  patterns  from  noise  in 
complex  systems,  which  makes  it  difficult  to  decide  what  part  to  model  and  what  to 
ignore.  Models  that  reduce  the  chance  of  fitting  noise  are  called  robust. 


21.4  Example  (Linear  Regression) 

Let’s  demonstrate  a  simple  model  assessment  using  linear  regression.  Suppose 
we  observe  the  response  values  {yx,  •  •  •,  yn},  and  the  corresponding 
k  predictors  represented  as  a  kD  vector  of  co variates  {v1?  •  •  •,  xn },  where  subjects/ 
cases  are  indexed  by  1  <  i  <  n,  and  the  data-elements  (variables)  are  indexed  by 
1  <  j  <  k. 


Al,l 

V  1 

Using  least  squares  to  estimate  the  linear  function  parameters  (effect-sizes),  /?1?  •  •  •, 
allows  us  to  compute  a  hyperplane  y  =  a  +  xfi  that  best  fits  the  observed  data 
(xhyi)\  <  i  <  n-  This  is  expressed  as  a  matrix  by: 


fy  1  \ 

(  a\  \ 

/  ^1,1  *  *  *  xl,k  \ 

(ki\ 

: 

— 

: 

+ 

•  •  • 

•  •  • 

•  • 

: 

\  y„  / 

\  J 

\  xn,  1  *  xn,k  J 

V  (h  J 

Corresponding  to  the  system  of  linear  hyperplanes: 


y  1  —  ^1  +  xl,  lP\  +  xl,2^2  +  '  '  '  +  x\,kPk 

y2  =  a2  -k  x2, \Pl  +  x2, 2P2  +  ’  ’  ’  +  x2 ,kPk 

<  yn  —  al  +  xn,  \P\  +  xn, 2P2  +  '  '  '  +  xn,kfik 

One  measure  to  evaluate  the  model  fit  may  be  the  mean  squared  error  (MSE). 

The  MSE  for  a  given  value  of  the  parameters  a  and  ft  on  the  observed  training 
data  (v/,y/)i  <  t  <  n  is  expressed  as: 


( 


\ 


yt  —  {a\  +  xi,  1P1  +  xi, 2P2  +  *  *  *  +  xi,kfik)  ) 


V 


J 


predicted  value  v  j .  at  x | ,  •  •  • ,  /, 
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And  the  corresponding  root  mean  square  error  (RMSE)  is: 


RMSE  = 


( 


\ 


—  (a I  +  xit  \Pi  +  Xit202  +  ’  ’  ’  +  Xi, kfik )  ) 


V 


predicted  value  y  .  at  Xj  |  x;-  j. 


J 


In  the  linear  model  case,  the  expected  value  of  the  MSE  (over  the  distribution  of 
training  sets)  for  the  training  set  is  E ,  where  E  is  the  expected  value  of  the 
MSE  for  the  testing/validation  data.  Therefore,  fitting  a  model  and  computing  the 
MSE  on  the  training  set,  we  may  produce  an  over  optimistic  evaluation  assessment 
(smaller  RMSE)  of  how  well  the  model  may  fit  another  dataset.  This  bias  represents 
in-sample  estimate  of  the  fit,  whereas  we  are  interested  in  the  cross-validation 
estimate  as  an  out-of-sample  estimate. 

In  the  linear  regression  model,  cross  validation  may  not  be  as  useful,  since  we  can 
compute  the  exact  correction  factor  to  obtain  an  estimate  of  the  (unknown) 
exact  expected  out-of-sample  fit  using  the  (known)  in-sample  MSE  (under)estimate. 
However,  even  in  this  situation,  cross-validation  remains  useful  as  it  can  be  used  to 
select  an  optimal  regularized  cost  function. 

In  most  other  modeling  procedures  (e.g.  logistic  regression),  there  are  no  simple 
general  closed-form  expressions  (formulas)  to  adjust  the  cross-validation  error 
estimate  of  the  known  in-sample  fit  to  estimate  the  unknown  out-of-sample  error 
rate.  Cross-validation  is  general  strategy  to  predict  the  performance  of  a  model  on  a 
validation  set  using  stochastic  computation  instead  of  obtaining  experimental,  the¬ 
oretical,  mathematical,  or  closed-form  analytic  error  estimates. 


21.4.1  Cross-Validation  Methods 

There  are  two  classes  of  cross-validation  approaches,  exhaustive  and  non¬ 
exhaust  ive. 


21.4.2  Exhaustive  Cross-Validation 

Exhaustive  cross-validation  methods  are  based  on  determining  all  possible  ways  to 
divide  the  original  sample  into  training  and  testing  data.  For  instance,  the  Leave-m- 
out  cross-validation  involves  using  m  observations  for  testing  and  the  remaining 
in  —  m)  observations  as  training.  The  case  when  m  =  1,  i.e.,  leave-one-out  method,  is 
only  applicable  when  n  is  small,  due  to  its  huge  computational  cost.  This  process  is 
repeated  on  all  partitions  of  the  original  sample.  This  method  requires  model  fitting 
and  validating  C"  times  (n  is  the  total  number  of  observations  in  the  original  sample 
and  m  is  the  number  of  observations  left  out  for  validation).  This  requires  a  very  large 
number  of  iterations. 
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21.4.3  N on-Exhaustive  Cross-Validation 

Non-exhaustive  cross  validation  methods  use  bootstrap  approximation  to  avoid 
computing  estimates/errors  using  all  possible  partitionings  of  the  original  sample. 
For  example,  in  the  k-fold  cross-validation,  the  original  sample  is  randomly 
partitioned  into  k  equal  sized  subsamples,  or  folds.  Of  all  k  subsamples,  a  single 
subsample  is  kept  as  final  testing  data  for  validation  of  the  model.  The  other  k  —  1 
subsamples  are  used  as  training  data.  The  cross-validation  process  is  then  repeated 
k  times,  corresponding  to  the  k  folds.  Each  of  the  k  subsamples  is  used  once  as  the 
validation  data.  In  the  end,  the  corresponding  k  results  are  averaged  (or  otherwise 
aggregated)  to  generate  a  final  pooled  model-quality  estimation.  In  k-fold  validation, 
all  observations  are  used  for  both  training  and  validation,  and  each  observation  is 
used  for  validation  exactly  once.  In  general,  k  is  a  parameter  that  needs  to  be  selected 
by  the  investigator  (common  values  may  be  5  or  10). 

A  general  case  of  the  k-fold  validation  is  k  =  n  (the  total  number  of 
observations),  when  it  coincides  with  the  leave-one-out  cross-validation. 

A  variation  of  the  k-fold  validation  is  stratified  k-fold  cross-validation, 
where  each  fold  has  (approximately)  the  same  mean  response  value.  For  instance, 
if  the  model  represents  a  binary  classification  of  cases  (e.g.,  controls  vs.  patients), 
this  implies  that  each  fold  contains  roughly  the  same  proportion  of  the  two  class 
labels. 

Repeated  random  sub-sampling  validation  splits  randomly  the  entire  dataset 
into  a  training  set,  where  the  model  is  fit,  and  a  testing  set,  where  the  predictive 
accuracy  is  assessed.  Again,  the  results  are  averaged  over  all  iterative  splits.  This 
method  has  an  advantage  over  k-fold  cross  validation,  as  the  proportion  of  the 
training/testing  split  is  not  dependent  on  the  number  of  iterations  (folds).  However, 
its  drawback  is  that  some  observations  may  never  be  selected  in  the  testing/valida¬ 
tion  subsample,  whereas  others  may  be  selected  multiple  times.  As  validation  sub¬ 
sets  may  overlap,  the  results  may  vary  each  time  we  repeat  the  validation  protocol, 
unless  we  set  a  seed  point  in  the  algorithm. 

Asymptotically,  as  the  number  of  random  splits  increases,  the  repeated  random 
sub-sampling  validation  approaches  the  leave-k-out  cross-validation. 


21.5  Case-Studies 

In  the  examples  below,  we  have  intentionally  suppressed  some  of  the  R  output 
to  save  space.  This  is  accomplished  using  this  Rmarkdown  command, 
{r  eval=TRUE ,  result s=  '  hide  '  } ,  however,  the  reader  is  encouraged 
to  try  hands-on  all  the  protocols,  to  make  modifications,  to  inspect,  and  finally  to 
interpret  the  outputs. 
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21.5.1  Example  1:  Prediction  of  Parkinson’s  Disease  Using 
Adaptive  Boosting  (AdaBoost) 


This  Parkinson’s  Diseases  study,  which  involves  heterogeneous  neuroimaging, 
genetics,  clinical,  and  phenotypic  data  for  over  600  volunteers  including  multivariate 
data  for  three  cohorts  (HC=Healthy  Controls,  PD=Parkinson’s,  SWEDD  =  subjects 
without  evidence  for  dopaminergic  deficit). 

#  update  packages 

#  update. packagesQ 

Load  the  data:  06_PPMI_ClassificationValidationData_Short.csv. 

ppmi_data  <- 

read.  csv (" https :// umich .  instructure .  com/fiLes/33Q4QQ/doiAjnLoad?doiAjnLoad_frd=l ", 
header=TRUE ) 

Binarize  the  Dx  (clinical  diagnosis)  classes. 

#  binarize  the  Dx  classes 

ppmi_data$ResearchGroup  <-  ifeLse(ppmi_data$ResearchGroup  ==  "Control" , 

" Control ",  "Patient") 
attach  (  ppnni_data  ) 

head( ppmi_data ) 

#  View  (ppmi_data) 

Obtain  a  model-free  predictive  analytics,  e.g.,  AdaBoost  classification,  and  report 
the  results. 

#  Model-free  analysis,  classification 

#  install. packages( "crossval" ) 

#  install. packages( "ada" ) 

#  library("crossval") 
require(crossval ) 
require (ada) 

#set  up  adaboosting  prediction  function 

#  Define  a  new  AdaBoost  classification  result-reporting  function 

my. ada  <-  function  ( train. x,  train. y>  test.Xj  test.y,  negative ,  formula){ 
a da. fit  <-  ada(train.Xj  train. y) 
predict. y  <-  predict (ada. fit ,  test.x) 

#count  TP,  FP,  TN,  FN,  Accuracy,  etc. 

out  <-  con fusionMatrix (test .y,  predict .y ,  negative  =  negative) 

#  negative  is  the  label  of  a  negative  "null"  sample  (default:  "control"). 
return  (out) 

} 


When  group  sizes  are  imbalanced,  we  may  need  to  rebalance  them  to  avoid 
potential  biases  of  the  dominant  cohorts.  In  this  case,  we  will  re-balance  the  groups 
using  the  package  SMOTE  Synthetic  Minority  Oversampling  Technique.  SMOTE 
may  be  used  to  handle  class  imbalance  in  binary  classification,  see  Chap.  3. 
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#  balance  cases 

#  SMOTE:  Synthetic  Minority  Oversampling  Technique  to  handle  class  misbalanc 
e  in  binary  classification. 

set.seed(1000) 

#  install. packages( "unbalanced" )  to  deal  with  unbalanced  group  data 
require (unbalanced) 

ppmi_data$PD  <-  ifeLse(ppmi_data$ResearchGroup=="Control"j  lj  0) 
unique ID  <-  unique (ppmi_data$FID_I ID) 
ppmi_data  <-  ppmi_data[ppmi_data$VisitID==lJ  ] 
ppmi_data$PD  <-  f actor (ppmi_data$PD) 

co L names ( ppmi_data ) 

#  ppmi_data.l<-ppmi_data[,  c(3:281,  284,  287,  336:340,  341)] 
n  <-  ncoL(ppmi_data) 

output. 1  <-  ppmi_data$PD 

#  ppmi_data$PD  <-  ifelse(ppmi_data$ResearchGroup=="Control",  1,  0) 

#  remove  Default  Real  Clinical  subject  classifications! 

input  <-  ppmi_data[j  -  which (names (ppmi_data)  %in%  c( "ResearchGroup" ,  "PD", 

"X"j  "FID_IID  "  )  )  ] 

#  output  <-  as.matrix(ppmi_data[ ,  which(names(ppmi_data)  %in%  {"PD"})]) 
output  <-  as. factor (ppmi_data$PD) 

c(dim(input) ,  dim(output)) 

#balance  the  dataset 
set.seed(123) 

data . l<-ubBa Lance (X=  input ,  Y=outputj  type="ubSMOTE"j  percOver=300j  percllnde 
r=150j  verbose=TRUE) 

balancedData< - cbind(data . l$Xj  data.l$Y) 
table (data .1$Y) 

nrow(data. 1$X);  ncol(data. 1$X) 
nrow(balancedData) ;  ncol(balancedData) 
nrow(input) ;  ncol(input) 

colnames(balancedData)  <-  c(colnames (input) ,  "PD") 

Next,  we’ll  check  the  re-balanced  cohort  sizes  (Fig.  21.2). 


Fig.  21.2  Quantile-quantile  plot  of  the  original  and  rebalanced  data  distributions  for  one  feature 
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###Check  balance 
##  T  test 
alpha. 0.05  <-  0.05 
test.results.bin  <-  NULL 
test .  results .  raiAj  <-  NULL 


#  binarized/dichotomized  p-values 

#  raw  p-values 


#  get  a  better  error-handling  t.test  function  that  gracefully  handles  NA's  a 
nd  trivial  variances 

my .t.test .p. value  <-  function (inputlj  input2)  { 
obj  <-  try (t.test (inputlj  input2)j  siLent=TRUE) 
if  (is(objj  "try-error" ) ) 
return (NA) 
else 

return(obj$p. value) 

} 

for  (i  in  1 : ncol(balancedData) ) 

{ 

test . results . raw[i]  <-  my .t.test. p. value (input [j  i]}  balancedData  [}  i]) 
test. results. bin[i]  <-  if else(test . results . raw[i]  >  alpha. 0 .05 j  1}  0) 

#  binarize  the  p-value  (0=significant,  l=otherwise) 

print(c("i=" jij  "var="j  colnames(balancedData[i]) }  "t-test_raw_p_value="j 
test .  results .  ra\Aj[i] ) ) 

} 

#  we  can  also  employ  (e.g.,  FDR,  Bonferonni)  correction  for  multiple 
testing! 

#  test.results.com  <-  stats :: p . ad just (test . results . raw,  method  =  "fdr"j 
n  =  length(test . results . raw) ) 

#  where  methods  are  "holm".,  "hochberg"J  "hommel"J  "bonferroni" ^  "BH"., 
"BY" j  "fdr" j  "none") 

#  plot(test . results . rawj  test . results . corr) 

#  sum(test .  results .  raw  <  alpha. 0.05.,  na .  rm=T)/length(test .  results .  raw) 
#check  proportion  of  inconsistencies 

#  sum(test .  results .  corr  <  alpha. 0.05.,  na.rm  =T)/length(test .  results .  corr) 

qqplot(input[j  5]3  balancedData  [}  5])  #  check  visually  for  differences 
between  the  distributions  of  the  raw  (input)  and  rebalanced  data  (for  only 
one  variable.,  in  this  case) 


#  NoWj  check  visually  for  differences  between  the  distributions  of  the  raw 
(input)  and  rebalanced  data. 

#  par(mar=c(lj 1^ 1^ 1) ) 

#  par(mfrow=c(10j 10) ) 

#  for(i  in  c(l : 62., 64: 101) ){  qqplot (balancedData  [}  i]jinput[j  i])  }  #except 
VisitID 

#  as  the  sample-size  is  changed: 

length (input [ j  5])j  length(balancedData  [j  5]) 

#  to  plot  raw  vs.  rebalanced  pairs  (e.g.j  var="L_insular_cortex_Volume" ) .,  we 
need  to  equalize  the  lengths 

#plot  (input[j  5]  +0*balancedData  [}  5]j  balancedData  [j  5])  #  [}  5]  == 

" L_insular_cortex_Volume" 

#  print(c( "T-test  results:  ",  test . results) ) 

#  zeros  (0)  are  significant  independent  between-group  T-test  differences, 
ones  (1)  are  insignificant 


for  (i  in  1 : (ncol( balancedData) -1) ) 

{ 
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test . results .  raw  [i]  <-  wilcox. test (input [ j  i j,  balancedData  [, 

i] )$p.  value 

test.results.bin  [i]  <-  ifeise(test . results . raw  [i]  >  alpha. 0.05,  lj  0) 
print(c("i="j  i}  "\4ilcoxon-test=" ,  test .  results .  raw  [i])) 

} 

print (c ( "lA/ilcoxon  test  results:  ",  test.results.bin)) 

#  test . results . corr  <-  stats :: p. adjust(test . results . raw,  method  =  "fdr",  n  = 
length (test . results . raw) ) 

#  where  methods  are  "holm",  "hochberg",  "hommel",  "bonferroni",  "BH",  "BY", 
"fdr",  "none") 

#  plot(test . results . raw,  test . results . corr) 

The  next  step  will  be  the  actual  cross-validation. 

#  using  raw  data: 

X  <-  as. data. frame (input) ;  Y  <-  output 
neg  <-  "1"  #  "Control"  ==  "1" 

#  using  Rebalanced  data: 

X  <-  as. data. frame (data. 1$X);  Y  <-  data.l$Y 

#  balancedData<-cbind(data . 1$X,  data.l$Y);  dim(balancedData) 

#  Side  note:  There  is  a  function  name  collision  for  "crossval",  the  same  met 
hod  is  present  in  the  "mlr"  (machine  Learning  in  R)  package  and  in  the  "cros 
sval"  package. 

#  To  specify  a  function  call  from  a  specific  package  do:  packagename: :funct 
ionname( ) 

set.seed(115) 

cv.out  <-  crossval: : crossval (my .ada,  X,  Y,  K  =  5,  B  =  1}  negative  =  neg) 

#  the  label  of  a  negative  "null"  sample  (default:  "control") 
out  <-  diagnostic Errors (cv.out$stat) 

print( cv.out$stat) 

##  FP  TP  TN  FN 
##  0.6  109.6  97.0  0.2 

print ( out) 

##  acc  sens  spec  ppv  npv  lor 

##  0.9961427  0.9981785  0.9938525  0.9945554  0.9979424  11.3918119 

As  we  can  see  from  the  reported  metrics,  the  overall  averaged  AdaBoost-based 
diagnostic  predictions  are  quite  good. 


21.5.2  Example  2:  Sleep  Dataset 

These  data  contain  the  effect  of  two  soporific  drugs  to  increase  hours  of  sleep 
(treatment-compared  design)  on  10  patients.  The  data  are  available  in  R  by  default 
(sleep  {datasets}). 

First,  load  the  data  and  report  some  graphs  and  summaries  (Fig.  21.3). 
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Fig.  21.3  Box-and  whisker  plots  of  the  hours  of  sleep  for  the  two  cohorts  in  the  sleep  dataset 


dato(sLeep) ;  str(sLeep) 

X  =  as. matrix (sieep[j  1,  drop=FALSE] )  #  increase  in  hours  of  sleep, 

#  drop  is  logical,  if  TRUE  the  result  is  coerced  to  the  lowest  possible 
dimension . 

#  The  default  is  to  drop  if  only  one  column  is  left,  but  not  to  drop  if  only 
one  row  is  left. 

Y  =  sLeep[j  2]  #  drug  given 
pLot(X  ~  Y) 


LeveLs(Y)  #  "1"  "2" 
dim(X)  #  20  1 

Next,  we  will  define  a  new  LDA  (linear  discriminant  analysis)  predicting  function 
and  perform  the  cross-validation  (CV)  on  the  resulting  predictor. 

require ("MASS")  #  for  Ida  function 

predfun.Lda  =  function (train. x,  train. y,  test.x,  test.y,  negative) 

{  Lda.fit  =  Lda (train .  x,  grouping=train .y) 
ynew  =  predict(Lda.fitj  test.x)$cLass 

#  count  TP,  FP  etc. 

out  =  confusionlvlatrix(test.yJ  ynewj  negative=negative) 
return (  out  ) 

} 

#  install. packages( "crossval" ) 

Library ( "crossvaL ") 

set. seed ( 123456 ) 

cv.out  <-  crossvaL: : crossvaL(predf un .  Lda ,  X,  Y,  K=5}  B=20j  negative="l" , 
verbose=FALSE ) 
cv. out$stat 

diagnostic Errors ( cv . out$stat ) 


710 


21  Prediction  and  Internal  Statistical  Cross  Validation 


Execute  the  above  code  and  interpret  the  diagnostic  results  measuring  the  per¬ 
formance  of  the  LDA  prediction. 


21.5.3  Example  3:  Model-Based  ( Linear  Regression) 

Prediction  Using  the  Atti  tude  Dataset 

These  data  represent  a  survey  of  clerical  employees  of  an  organization  with 
35  employees  in  30  (randomly  selected)  departments.  The  data  include  the  propor¬ 
tion  of  favorable  responses  to  7  questions  in  each  department. 

Let’s  load  and  summarize  the  data,  which  is  available  in  the  R  { datasets }  as 
attitude. 

#  Pattitude,  colnames(attitude) 

#  Note:  when  using  a  data  frame,  a  time-saver  is  to  use  to  indicate  " 
include  all  covariates"  in  the  DF. 

#  E.g.,  fit  <-  lm(Y  ~  .,  data  =  D) 

data ( "attitude ") 

y  =  attitude[j  1]  #  rating  variable 

x  =  attitude[j  -1]  #  date  frame  with  the  remaining  variables 

is  .factor  (y) 

summary (  Lm(y  ~  .  ,  data=x)  )  #  R- squared:  0.7326 

#  set  up  lm  prediction  function 

We  will  demonstrate  model-based  analytics  using  lm  and  Ida,  and  then  will 
validate  the  forecasting  using  CV. 

predfun.Lm  =  function(train.Xj  train. y}  test.Xj  test.y) 

{  Lm.fit  =  Lm(train .y  ~  .  ,  data=train.x) 
ynew  =  predict(Lm.fitj  test.x  ) 

#  compute  squared  error  risk  (MSE) 
out  =  mean(  (ynew  -  test .y)*2) 

#  note  that,  in  general,  when  fitting  linear  model  to  continuous 
outcome  variable  (Y), 

#  we  can't  use  the  out<-confusionMatrix(test .y,  ynew,  negatives 
egative),  as  it  requires  a  binary  outcome 

#  this  is  why  we  use  the  MSE  as  an  estimate  of  the  discrepancy  b 
etween  observed  &  predicted  values 

return(out) 

} 

#  require( "MASS" ) 

#predfun.lda  =  f unction(train . x,  train. y,  test.x,  test.y,  negative) 

#{  Ida. fit  =  lda(train.x,  grouping=train .y) 

#  ynew  =  predict(lda.fit,  test . x)$class 

#  count  TP,  FP  etc. 

#  out  =  confusionMatrix(test .y,  ynew,  negative=negative) 

#return(  out  ) 

#} 
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#  prediction  MSE  using  all  variables 
set. seed (123456) 

cv.out.Lm  =  crossvaL::crossvaL(predfun.Lmj  xy  yy  K=5j  B=20j  verbose=FALSE) 
c(cv.out.  Lm$statj  cv . out .  Lm$stat . se)  #  72.581198  3.736784 

#  reducing  to  using  only  two  variables 

cv.out.Lm  =  crossvaL: :crossvaL(predfun.Lmy  x[y  c(l>  3)]y  yy  K=5j  B=20j  verbo 
se= FALSE) 

c(cv.out.  Lm$statj  cv.out. Lm$stat . se) 

#  52.563957  2.015109 


21.5.4  Example  4:  Parkinson ’s  Data  (ppmi_data) 

Let’s  go  back  to  the  more  elaborate  PD  data  and  start  by  loading  and  preprocessing 
the  derived-PPMI  data. 

#  ppmi_data  c-read . csv( "https : //umich . in structure. com/files/ 330400/down load? 
download_f rd=l" j  header=TRUE) 

#  ppmi_data$ResearchGroup  <-  ifelse(ppmi_data$ResearchGroup  ==  "Control".,  "C 
ontrol".,  "Patient") 

#  attach(ppmi_data) ;  head(ppmi_data) 

#  install . packages( "crossval" ) 

#  library( "crossval" ) 

#  ppmi_data$PD  <-  ifelse(ppmi_data$ResearchGroup=="Control" j  ly  0) 

#  input  <-  ppmi_data[  ,  -which(names(ppmi_data)  %in%  c("ResearchGroup"j 
"PD" y  "X",  "FID_IID" ) ) ] 

#  output  <-  as.factor(ppmi_data$PD) 

#  remove  the  irrelevant  variables  (e.g.j  visit  ID) 
output  <-  as. factor (ppmi_data$PD) 

input  <-  ppmi_data[ j  -which (names (ppmi_data)  %in%  c( "ResearchGroup" y  "PD"j  " 

X ",  "FID_IID  "j  "VisitID"))] 


X  =  as.matrix(input)  #  Predictor  variables 
Y  =  as.matrix(output)  #  Actual  PD  clinical  assessment 
dim(X)  ; 
dim(Y) 

Layout (matrix(c(ly  2 ,  3,  4)y  2,  2))  #  optional  4  graphs/page 

fit  <-  Lm(Y~X)j 

pLot(fit)  #  plot  the  fit 


LeveLs(as. factor (Y) )  #  "0"  "1" 

##  [1]  "0"  "1" 

c(dim(X)j  dim(Y) )  #  1043  103 

##  [1]  422  100  422  1 

Apply  cross-validation  to  assess  the  performance  of  the  linear  model 
(Fig.  21.4). 
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Fig.  21.4  Residual  plots  provide  exploratory  analytics  of  the  model  quality 


set.seed( 12345) 

#  cv.out.lm  =  crossval :  : crossval(predfun . lrrij  as.data.frame(X) ^  as . numeric(Y) 
,  K=5 ,  B=20) 

cv. out. Ida  =  crossvaL::crossvaL(predfun.Ldaj  X,  Y,  K=5}  B=20}  negative=''l"j 
verbose=FALSE ) 

#  K=Number  of  folds;  B=Number  of  repetitions. 

#  Results 
#cv.out.lda$stat; 

#cv.out .Ida; 

diagnostic Errors ( cv . out .  Lda$stat) 

##  acc  sens  spec  ppv  npv  Lor 

##  0.9617299  0.9513333  0.9872951  0.9945984  0.8918919  7.3258500 

#cv.out.lm$stat; 

#cv.out.lm; 

#diagnosticErrors(cv.out . lm$stat) 


21.6  Summary  of  CV  output 


The  cross-validation  (CV)  output  object  includes  the  following  three  components: 

•  st at .  cv:  Vector  of  statistics  returned  by  predfun  for  each  cross  validation  run. 

•  st  at:  Mean  statistic  returned  by  predfun  averaged  over  all  cross  validation  runs. 

•  stat.se:  Variability  measuring  the  corresponding  standard  error. 


21.7  Alternative  Predictor  Functions 

We  have  already  seen  a  number  of  predict  ( )  functions,  e.g.,  Chap.  18.  Below, 
we  will  add  to  the  collection  of  predictive  analytics  and  forecasting  functions. 
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21.7.1  Logistic  Regression 

We  already  saw  the  logit  model  in  Chap.  18.  Now,  we  will  demonstrate  a  logit- 
predictor  function  by  applying  it  to  the  PD  dataset. 

#  ppmi_data  < -read . csv( "https : //umich . in structure . com/ files/ 330400/down load ? 
download_frd=l",  header=TRUE) 

#  ppmi_data$ResearchGroup  <-  ifelse(ppmi_data$ResearchGroup  ==  "Control"., 
"Control",  "Patient") 

#  install. packages( "crossval" ) ;  library( "crossval" ) 

#  ppmi_data$PD  <-  ifelse(ppmi_data$ResearchGroup=="Control" ,  1,  0) 

#  remove  the  irrelevant  variables  (e.g.,  visit  ID) 
output  <-  as. factor (ppmi_data$PD) 

input  <-  ppmi_data[j  -which (names (ppmi_data)  %in%  c( "ResearchGroup" ,  "PD" , 

"X",  "FID_IID  ",  "VisitID"))] 

X  =  as.matrix(input)  #  Predictor  variables 
Y  =  as.matrix(output) 

Note  that  the  predicted  values  are  in  log  terms,  so  they  need  to  be  exponentiated  to 
be  correctly  interpreted. 

Lm.  Logit  <-  gLm(as.numeric(Y)  ~  .,  data  =  as. data. frame (X),  famiiy  = 

"binomiaL ") 

ynew  <-  predict ( Lm.  Logit ,  as .data. frame (X)) ;  #plot(ynew) 
ynew2  <-  if else (exp (y new) <Q .  5,  0,  1) ;  #  plot(ynew2) 

predfun .  Logit  =  function (train. x,  train. yt  test.x,  test.y,  neg) 

{  Lm.  Logit  <-  gLm(train .y  ~  data  =  train. x,  family  =  "binomiaL" ) 
ynew  =  predict ( Lm.  Logit ,  test.x  ) 

#  compute  TP,  FP,  TN,  FN 
ynew2  <-  ifelse(exp(ynew)<0.  5j  0,  1) 

out  =  confusionMatrix(test.yj  ynew2j  negative=neg)  #  Binary  outcome, 
we  can  use  confusionMatrix 
return (  out  ) 

} 

#  Reduce  the  bag  of  explanatory  variables,  purely  to  simplify  the 
interpretation  of  the  analytics  in  this  example! 

input . short  <-  input[j  which (names (input)  %in%  c( "R_fusiform_gyrus_VoLume"j 
"R_fusiform_gyrus_ShapeIndex"j  "R_fusiform_gyrus_Curvedness ", 
"Sex”,  " Height ",  "Age",  "chrl2_rs34637584_GT"J  "chrl7_rsll868035_GT" , 
"UPDRS_Part_I_Summary_Score_BaseLine" , 

"UPDRS_Part_I_Summary_Score_Month_03  ", 

"UPDRS_Part_II_Patient_Questionnaire_Summary_Score_BaseLine"j 
"UPDRS_Part_III_Summary_Score_BaseLine" , 

"X_Assessment_Non .Motor_Epworth_Sleepiness_ScaLe_Summary_Score_Baseline" 

))] 

X  =  as. matrix(input. short) 

cv . out . logit  =  crossval: :crossvaL(predfun.  Logit,  as. data. frame (X) , 
as. numeric (Y),  K=5 ,  B=2 ,  neg="l"j  verbose=FALSE) 
cv. out.  Logit$stat.  cv 
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## 

##  B1.F1 
##  B1.F2 
##  B1.F3 
##  B1.F4 
##  B1.F5 
##  B2.F1 
##  82. F2 
##  82.  F3 
##  82.  F4 
##  82. F5 


FP  TP  TN  FN 

1  50  31  2 

0  60  19  6 

2  55  19  8 

3  58  23  0 

3  60  21  1 

2  56  22  4 

0  57  23  5 

3  60  20  1 

1  58  23  2 

1  54  27  3 


diagnosticErrors ( cv . out . Logit$stat) 

##  acc  sens  spec  ppv  npv  Lor 

##  0.9431280  0.9466667  0.9344262  0.9726027  0.8769231  5.5331424 


Caution:  Note  that  if  you  forget  to  exponentiate  the  values  of  the  predicted  logistic 
model  (see  ynew2  in  predict,  log  it),  you  will  get  nonsense  results,  e.g.,  all 
cases  may  be  predicted  to  be  in  one  class,  trivial  sensitivity,  or  incorrect  NPP. 


21. 7.2  Quadratic  Discriminant  Analysis  ( QDA ) 

In  Chaps.  8  and  21,  we  discussed  the  linear  and  quadratic  discriminant  analysis 
models.  Let’s  now  introduce  a  predf  un .  qda  ( )  function. 

predfun .  qda  =  function (train . Xj  train. y>  test.Xj  test.yj  negative) 

{ 

require ("MASS")  #  for  Ida  function 
qda. fit  =  qda(train.Xj  grouping=train .y) 
ynew  =  predict (qda. fit j  test.x)$ciass 

out.  qda  =  confusionMatrix(test.yj  ynewj  negative=negative) 
return (  out. qda  ) 

} 

cv. out. qda  =  crossvaL: :crossvaL (predf un. qda j  as. data. frame (input . short) j 
as  .factor  (Y)  j  K=5}  B=20}  neg="l") 

##  Error  in  qda.defauLt(Xj  groupingj  ...):  rank  deficiency  in  group  1 
diagnosticErrors ( cv . out .  Lda$stat) ;  diagnostic Errors (cv. out . qda$stat) ; 

##  Error  in  diagnostic Errors (cv .out . qda$stat) :  object  'cv. out. qda'  not  found 

This  error  message:  " Error  in  qda.default(x,  grouping,  ...):  rank  deficiency  in 
group  r  indicates  that  there  is  a  rank  deficiency,  i.e.  some  variables  are  collinear  and 
one  or  more  covariance  matrices  cannot  be  inverted  to  obtain  the  estimates  in  group 
1  (Controls). 

If  you  remove  the  strongly  correlated  data  elements 
( ' '  R_fusiform_gy  rus_ V  olume  ",  "  R_fusiform_gy  rus_ShapeIndex '  ’ ,  and 

"R_fusiform_gyrus_Curvedness"),  the  rank-deficiency  problem  goes  away. 
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input . short2  <-  input [j  which (names (input)  %in%  c("R_fusiform_gyrus_yoLume"J 
" Sex ",  "Weight";  "Age"  ,  "chrl7_rsll868035_GT" , 

"UPDRS_Part_I_Summary_Score_BaseLine" j 

"UPDRS_Part_II_Patient_Questionnaire_Summary_Score_BaseLine" j 
"UPDRS_Part_III_Summary_Score_Baseiine" j 

"X_Assessment_Non .Motor_Epworth_SLeepiness_ScaLe_Summary_Score_BaseLine" 

))] 

X  =  as .matrix (input . short 2) 

cv.out.qda=  crossvaL::crossvaL(predfun.qdaj  as  .data,  frame  (X) }  as. numeric (Y ) j 
K=5;  B=2;  neg="l") 

It  makes  sense  to  contrast  the  QDA  and  GLM/Logit  predictions. 

diagnostic Errors (cv . out . qda$stat) ;  diagnosticErrors(cv. out.  Logit$stat) 

##  acc  sens  spec  ppv  npv  Lor 

##  0.9407583  0.9533333  0.9098361  0.9629630  0.8880000  5.3285694 

##  acc  sens  spec  ppv  npv  Lor 

##  0.9431280  0.9466667  0.9344262  0.9726027  0.8769231  5.5331424 

Clearly,  both  the  QDA  and  Logit  model  predictions  are  quite  similar  and  reliable. 


21.7.3  Foundation  ofLDA  and  QDA  for  Prediction, 
Dimensionality  Reduction,  and  Forecasting 


Previously,  in  Chap.  8  we  saw  some  examples  of  LDA/QDA  methods.  Now,  we’ll 
provide  more  details.  Both  LDA  (Linear  Discriminant  Analysis)  and 
QDA  (Quadratic  Discriminant  Analysis)  use  probabilistic  models  of  the  class  con¬ 
ditional  distribution  of  the  data  P(X  \  Y  =  k)  for  each  class  k.  Their  predictions  are 
obtained  by  using  the  Bayesian  theorem  (http://wiki.socr.umich.edu/index.php/ 
SMHS_B  ay  esianInference#B  ay  esian_Rule) : 


P(X\Y  =  m?  =  k)  P(XjY  =  k)P(Y  =  k) 

[  P(X)  EZo  P(X\Y  =  l)P(Y  =  l) 

Thus,  we  select  the  class  k ,  which  maximizes  this  conditional  probability  (maximum 
likelihood  estimation).  In  linear  and  quadratic  discriminant  analysis,  P(X  \  Y)  is 
modelled  as  a  multivariate  Gaussian  distribution  with  density: 


P(X  |  Y  =  k) 


1 


(27T)n\Z 


-  x  e 


(4  (x-f,k)T  z;' (X-Mk)) 


This  model  can  be  used  to  classify  data  by  using  the  training  data  to  estimate: 


(1)  The  class  prior  probabilities  P(Y  =  k)  by  counting  the  proportion  of  observed 
instances  of  class  k, 
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(2)  The  class  means  \ik  by  computing  the  empirical  sample  class  means,  and 

(3)  The  covariance  matrices  by  computing  either  the  empirical  sample  class  covari¬ 
ance  matrices,  or  by  using  a  regularized  estimator,  e.g.,  LASSO). 

In  the  linear  case  (LDA),  the  Gaussians  for  each  class  are  assumed  to  share  the 
same  covariance  matrix:  Zk  =  X  for  each  class  k.  This  leads  to  linear  decision 
surfaces  separating  different  classes.  This  is  clear  from  comparing  the  log-probabil¬ 
ity  ratios  of  a  pair  of  2  classes  ( k  and  /): 

LOR  =  log  ^((Y=/|xj)  ’  LOR  =  0  <^>  the  two  probabilities  are  identical,  i.e., 
same  class) 

lor  =  yj)^)  =  (ft  -  ftf^Aft  -  ft)  =  \  (ftr^_1ft-  ^V/)- 

But,  in  the  more  general,  quadratic  case  of  QDA,  there  are  no  assumptions  on  the 
covariance  matrices  Ek  of  the  Gaussians,  leading  to  more  flecible  quadratic  decision 
surfaces  separating  the  classes. 


LDA  (Linear  Discriminant  Analysis) 

LDA  is  similar  to  GLM  (e.g.,  ANOVA  and  regression  analyses),  as  it  also  attempts 
to  express  one  dependent  variable  as  a  linear  combination  of  the  other  features  or 
data  elements,  However,  ANOVA  uses  categorical  independent  variables  and  a 
continuous  dependent  variable,  whereas  LDA  has  continuous  independent  variables 
and  a  categorical  dependent  variable  (i.e.,  Dx/class  label).  Logistic  regression  and 
probit  regression  are  more  similar  to  LDA  than  ANOVA,  as  they  also  explain  a 
categorical  variable  by  the  values  of  continuous  independent  variables. 


predfun .  Lda  =  function (train. Xj  train. y}  test.Xj  test.yj  neg) 
{ 

require ("MASS") 

Lda. fit  =  Lda(train.Xj  grouping=train .y) 

ynew  =  predict ( Lda. fit j  test.x)$cLass 

out.  Lda  =  confusionMatrix(test.yj  ynevjj  negative=neg) 

return (  out.  Lda  ) 

} 


QDA  (Quadratic  Discriminant  Analysis) 


Similarly  to  LDA,  the  QDA  prediction  function  can  be  defined  by: 


predfun. qda  =  function (train. Xj  train. y}  test.Xj  test.yj  neg) 
{ 

require (" MASS" )  #  for  Ida  function 

qda. fit  =  qda(train.Xj  grouping=train .y) 

ynew  =  predict(qda.fitj  test.x)$cLass 

out.  qda  =  confusionMatrix(test.yj  ynewj  negative=neg) 

return (  out. qda  ) 

} 
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21.7.4  Neural  Networks 

We  already  saw  Artificial  Neural  Networks  (NNs)  in  Chap.  11.  Applying  NNs  is  not 
straightforward.  We  have  to  create  a  design  matrix  with  an  indicator  column  for  the 
response  feature.  In  addition,  we  need  to  write  a  predict  function  to  translate  the 
output  of  neural  net  ( )  into  analytical  forecasts. 

#  predict  nn 

Library ( "neuraLnet" ) 

pred  =  f unction (nrij  dat)  { 

yhat  =  compute(nnj  dat)$net .result 
yhat  =  apply (yhat j  1,  which. max) -1 
return (yhat) 

} 

my. neural  <-  function  (train. x}  train. y}  test.Xj 
test.yjmethodj  Layer=c(5j 5)){ 

train.x  <-  as. data. frame (train. x) 
train. y  <-  as. dat a. frame (train .y) 
colnames (train .x)  <-  paste0(  'V' ,  1 : ncol(X) ) 
colnames (train .y)  <-  "VI" 

train_y_ind  =  model. matrix(~f actor (train .y$Vl) -1) 

colnames (train_y_ind)  =  paste0( ' out ' j  0:1) 

train  =  cbind(train.Xj  train_y_ind) 

y_names  =  paste0( ' out ' ,  0:1) 

x_names  =  paste0( 'V' j  l:ncol(train.x)) 

nn  =  neuraLnet ( 

paste ( paste (y_nameSj  collapse= '+' )j 

i  i 

rsj 

J 

paste (x_nameSj  collapse=  '  +' ))j 

train j 

hidden=layerj 

Linear . output=FALSE  j 
Lifesign=  ' full Lifesign . step=1000) 

#predict 

predict. y  <-  pred(nn_,  test.x) 

out  <-  crossval : :confusionMatrix(test.yj  predict .yjnegative  =  0) 
return  (out) 

} 

set.seed(1234) 

cv.out.nn  <-  crossval :: crossval(my . neuralj  scale(X)j  Y3  K  =  5j  B  = 
lj  Layer=c(20j20) jverbose  =  F)  #  scale  predictors  is  necessary. 


##  hidden: 
time:  0.08 

20 j  20 

secs 

thresh:  0.01 

rep:  1/1 

steps: 

63 

error : 

1.02185 

##  hidden: 
time:  0.2  i 

20 j  20 

secs 

thresh:  0.01 

rep:  1/1 

steps: 

79  error:  1 

.01374 

##  hidden: 
time:  0.09 

20 j  20 

secs 

thresh:  0.01 

rep :  1/1 

steps: 

73 

error: 

1 . 02399 

##  hidden: 
time:  0.09 

20 }  20 

secs 

thresh:  0.01 

rep :  1/1 

steps: 

66 

error: 

1.03016 

##  hidden: 

20 j  20 

thresh:  0.01 

rep:  1/1 

steps: 

72 

error: 

1 . 01491 

time:  0.11  secs 
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crossvaL : : diagnosticErrors (cv . out . nn$stot) 

##  acc  sens  spec  ppv  npv 

##  0.9454976303  0.9016393443  0.9633333333  0.9090909091  0.9601328904 
##  Lor 

##  5.4841051313 

Again  the  forecasting  results  on  the  PD  dataset  are  quite  good. 

21.7.5  SVM 


In  Chap.  1 1,  we  also  saw  SVM  classification.  Let’s  try  cross-validation  using  Linear 
and  Gaussian  (radial)  kernel  SVM.  We  may  expect  that  linear  SVM  would  achieve 
a  similar  result  to  Gaussian,  or  even  better  than  Gaussian  SVM,  since  this  dataset  has 
a  large  k  (#  features)  compared  with  n  (#  cases),  which  we  explored  in  detail  in 
Chap.  11. 

Library ("el071 ") 

my.svm  <-  function  (train. train. y}  test.Xj 

tes t . y} methodj cos t=l,  gamma=l/nco L ( dx_norm ) , coef0=0j deg  re e= 3 ) { 

svm_L.fit  <-  svm(x  =  train. x}  y=as. factor (train. y) jkerneL  =  method) 
predict. y  <-  predict (svm_L. fit j  test.x) 

out  <-  crossvaL: :confusionMatrix(test.yj  predict .y} negative  =  0) 
return  (out) 

} 

#  Linear  kernel 
set.seed(123) 

cv.out. svmL  <-  crossvaL: :crossvaL(my.svmj  as. data. frame (X)j  Y,  K  =  5,  B  =  1} 
method  =  " Linear" j cost=tune_svm$best .parameter s$costj verbose  =  F) 
diagnosticErrors ( cv . out . svmL$stat) 

##  acc  sens  spec  ppv  npv 

##  0.9502369668  0.9098360656  0.9666666667  0.9173553719  0.9634551495 

##  Lor 

##  5.6789307585 

#  Gaussian  kernel 
set.seed(123) 

cv.out. svmg  <-  crossvaL: : crossvaL (my. svmj  as. data. frame (X) j  Y>  K  =  5,  B  =  1} 
method  =  "radiaL ", cost=tune_svmg$best . parameter s$cost ,  gamma=tune_svmg$best . p 
arameters$gammaj verbose  =  F) 
diagnosticErrors ( cv . out . svmg$stat) 

##  acc  sens  spec  ppv  npv 

##  0.9454976303  0.9262295082  0.9533333333  0.8897637795  0.9694915254 

##  Lor 

##  5.5470977226 

Indeed,  both  types  of  kernels  yield  good  quality  predictors  according  to  the 
assessment  metrics  reported  by  the  diagnosticErrors  ( )  method. 
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21.7.6  k-Nearest  Neighbors  Algorithm  (k-NN) 

As  we  saw  in  Chap.  7,  k-NN  is  a  non-parametric  method  for  either  classification  or 
regression,  where  the  input  consists  of  the  k  closest  training  examples  in  the  feature 
space,  but  the  output  depends  on  whether  k-NN  is  used  for  classification  or 
regression: 

•  In  k-NN  classification ,  the  output  is  a  class  membership  (labels).  Objects  in  the 
testing  data  are  classified  by  a  majority  vote  of  their  neighbors.  Each  object  is 
assigned  to  a  class  that  is  most  common  among  its  k  nearest  neighbors  (i k  is 
always  a  small  positive  integer).  When  k  =  1,  then  an  object  is  assigned  to  the 
class  of  its  single  nearest  neighbor. 

•  In  k-NN  regression ,  the  output  is  the  property  value  for  the  object  representing  the 
average  of  the  values  of  its  k  nearest  neighbors. 

Let’s  now  build  the  corresponding  predf  un .  knn  ( )  method. 


#  X  =  as.matrix(input)  #  Predictor  variables  X  =  as.matrix(input . short2) 

#  Y  =  as . matrix(output)  #  Outcome 

#  KNN  (k-nearest  neighbors) 

Library ("class ") 

#  knn. fit. test  <-  knn(X,  X,  cl  =  Y,  k=3,  prob=F);  predict(as.matrix(knn.fit. 
test),  X)$class 

#  table(knn.fit .test,  Y);  confusionMatrix(Y,  knn. fit .test,  negative="l" ) 

#  This  can  be  used  for  polytomous  variable  (multiple  classes) 

predf  un . knn  =  f unction (train .x,  train. y>  test.Xj  test.yj  neg) 

{ 

require( "class" ) 

knn. fit  =  knn(train .Xj  test.Xj  cl  =  train. y}  prob=T)  #  knn  is  already 
a  prediction  function!!! 

#  ynew  =  predict(knn.fit,  test .x)$class  #  no  need  of  another 

prediction,  in  this  case 

out. knn  =  confusionMatrix(test.yj  knn.fitj  negative=neg) 
return (  out. knn  ) 

} 

cv. out. knn  =  crossval: :crossval(predfun.knnj  X}  Y,  K=5j  B=2}  neg="l") 

cm .out .knn  =  crossval: : crossval (predf un .knn ,  X,  Y,  K=5 ,  B=2}  neg="l") 

#Compare  all  3  classifiers  (Ida,  qda,  knn,  and  logit) 

diagnostic Errors ( cv . out .  Lda$stat) ;  diagnostic Errors ( cv . out.qda$stat); 

diagnostic Errors ( cv . out . qda$stat) ;  diagnosticErrors(cv. out.  Logit$stat) ; 

We  can  also  examine  the  performance  of  k-NN  prediction  on  the  PPMI 
(Parkinson’s  disease)  data.  Start  by  partitioning  the  data  into  training  and 
testing  sets. 
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#  TRAINING:  75%  of  the  sample  size 
sampLe_size  <-  fLoor(0 .75  *  nrow(input) ) 

##  set  the  seed  to  make  your  partition  reproducible 
set.seed( 1234) 

input .train. ind  <-  sample(seq_Len(nrow(input) )j  size  =  sample_size) 

input .train  <-  input [input .train. indj  ] 

output .train  <-  as. mat rix( out put) [input .train .ind j  ] 

#  TESTING  DATA 

input .test  <-  input [-input. train. ind j  ] 

output .test  <-  as .matrix(output) [ -input .train .indj  ] 

Then,  we  can  fit  the  k-NN  model  and  report  the  results. 

Library ("class ") 

knn_modeL  <-  knn(train=  input .train j  input .testj  cL=as. factor (output .train) j 
k=2) 

#plot ( knn_model ) 
summary (knn_mo del ) 
attributes ( knn_mode L ) 

#  cross-validation 

knn_model . cv  <-  knn.cv(train=  input .train j  cL=as .factor (output .train) j  k=2) 
summary ( knn_mode  L . cv ) 


21.7.7  k-Means  Clustering  (k-MC) 

In  Chap.  13,  we  showed  that  k-MC  aims  to  partition  n  observations  into 
k  clusters,  where  each  observation  belongs  to  the  cluster  with  the  nearest  mean, 
which  acts  as  a  prototype  of  a  cluster.  The  k-MC  partitions  the  data  space  into 
Voronoi  cells.  In  general,  there  is  no  computationally  tractable  solution  for  this,  i.e., 
the  problem  is  NP-hard.  However,  there  are  efficient  algorithms  that  converge 
quickly  to  local  optima,  e.g.,  the  expectation-maximization  algorithm  for  mixtures 
of  Gaussian  distributions  via  an  iterative  refinement  approach  (Figs.  21.5,  21.6 
and  21.7). 


kmeans_model  <-  kmeans (input .train j  2) 

Layout(matrix(lj  1)) 

#  tiff("C:/Users/User/Desktop/test.tiff".,  width  =  10^  height  =  10.,  units  = 
in'j  res  =  300) 

f pc : :pLotcluster (input .train j  output .train }  col  =  kmeans_model$c luster) 


cluster :: clusplot (input .train j  kmeans_model$clusterj  color=TRUEj  shade=TRUEj 
labels=2j  Lines=0) 
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Fig.  21.5  k-Means  clustering  plot  ()  of  the  Parkinson’s  disease  data  (PPMI) 


CLUSPLOT(  input. train  ) 


^ - ~r - t - 1 - 1 - i - 1 - r 

-10  -5  0  5  10  15  20  25 

Component  1 

These  two  components  explain  45.57  %  of  the  point  variability. 

Fig.  21.6  Clusterplot  of  the  k-means  clustering  of  the  PPMI  data 


par(mfrow=c(10j 10)) 

#  the  next  figure  is  very  large  and  will  not  render  in  RStudio,  you  may  need 
to  save  it  as  PDF  file! 

#  pdf ( "C : /Users/User/Desktop/test . pdf " j  width  =  50.,  height  =  50) 

#  with(ppmi_data[jl:10] j  pairs(input.train[,l:10]j  col=c(l : 2) [kmeans_model$c 
luster] ) ) 

#  dev.off() 

with(ppmi_data[ j 1 : 10] j  pairs ( input. train [j 1:10] , 
coi=c(l : 2) [kmeans_modeL$c Luster] ) ) 
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Fig.  21.7  Pair  plots  of  the  two  clustering  lables  along  the  first  10  PPMI  features 


#  plot(input .train,  col  =  kmeans_model$cluster) 

#  points(kmeans_model$centers j  col  =  1:2,  pch  =  8,  cex  =  2) 


##  cLuster  centers  "fitted"  to  each  obs.: 

fitted. kmeans  <-  f  itted (kmeans _model) ;  head( fitted. kmeans) 


## 
##  2 
##  2 
##  1 
##  2 
##  2 
##  2 
## 
##  2 
##  2 
##  1 
##  2 
##  2 
##  2 


L_insuLar_cortex_AvgMeanCurvature 

0.1071299082 

0.1071299082 

0.2221893533 

0.1071299082 

0.1071299082 

0.1071299082 


L_insuLar_cortex_ComputeArea 

2635.580514 

2635.580514 

1134.578902 

2635.580514 

2635.580514 

2635.580514 


L_insuLar_cortex_VoLume  L_insuLar_cortex_ShapeIndex 


7969.485443 

7969.485443 

2111.385018 

7969.485443 

7969.485443 

7969.485443 


0.3250065829 

0.3250065829 

0.2788562513 

0.3250065829 

0.3250065829 

0.3250065829 


resid. kmeans  <-  (input .train  -  fitted(kmeans_modeL) ) 

#  define  the  sum  of  squares  function 

ss  <-  function (data)  sum(scaLe(dataj  scaLe  =  FALSE)*2) 


0,25  0  0.1  0  0.1  0.8 


21.7  Alternative  Predictor  Functions 


723 


##  Equalities 

cbind(kmeans_model[c( "betweenss" ,  "tot.withinss"j  "totss" ) ] ,  #  the  same  two 
columns 

c  (ss(  fitted. kmeans) j  ss(resid.  kmeans) }  ss(input  .train)  )  ) 

##  [, 1 ]  [,2] 

##  betweenss  15462062254  15462062254 

##  tot .within ss  12249286905  12249286905 

##  totss  27711349159  27711349159 

#  validation 

stopifnot (all.  equal  ( kmeans_model$totss ,  ss ( input .  train ))_, 

al L . equal ( kmeans_model$tot . withinssj  ss ( resid. kmeans ))j 
##  these  three  are  the  same: 

all . equal (kmeans_model$betweenss  j  ss(  fitted. kmeans) )j 

all.equal(kmeans_modeL$betweensSj  kmeans_model$totss  - 
hmeans_mode  L$tot . withinss ) j 
##  and  hence  also 

all. equal (ss (input .train) j  ss( fitted. kmeans)  +  ss(resid. kmeans)) 

) 

#  kmeans(input .train,  l)$withinss 

#  trivial  one-cluster,  (its  IaI.SS  ==  ss(input .train) ) 
clust_kmeans2  =  kmeans (scale (X) ,  cent er=X[l : 2} ] ,  iter .max=100j 
algorithm=  'Lloyd' ) 

We  may  get  empty  clusters,  instead  of  two  clusters,  when  we  randomly  select  two 
points  as  the  initial  centers.  The  way  to  solve  this  problem  is  using  k-means++. 

#  k++  initialize 

kpp_init  =  function(datj  K)  { 
x  =  as .matrix(dat) 
n  =  nrow(x) 

#  Randomly  choose  a  first  center 
centers  =  matrix(NAj  nrow=Kj  ncol=ncol(x) ) 
centers[lj ]  =  as.matrix(x[sample(l:nj  l)j]) 
for  (k  in  2:K)  { 

#  Calculate  distA2  to  closest  center  for  each  point 
dists  =  matrix(NAj  nrow=nj  ncol=k-l) 

for  (j  in  l:(k-l))  { 

temp  =  sweep(Xj  2}  centers[jj]j  ’-') 
dists[jj]  =  rowSums(tempA2) 

} 

dists  =  rowMeans(dists) 

#  Draw  next  center  with  probability  proportional  to  distA2 
cumdists  =  cumsum(dists) 

prop  =  runif(lj  min=0j  max=cumdists[n] ) 

centers[kj ]  =  as.matrix(x[min(which(cumdists  >  prop))j]) 

} 

return(centers ) 

} 

clust_kmeans2_plus  =  kmeans (scale (X) ,  kpp_init (scale (X) ,  2),  iter .max=100}  a 
Lgorithm=  'Lloyd' ) 

Now  let’s  evaluate  the  model.  The  first  step  is  to  justify  the  selection  of  k=2.  We 
use  the  method  silhouette  ( )  in  package  cluster.  Recall  from  Chap.  14  that 
the  silhouette  value  is  between  —  1  and  1 .  Negative  silhouette  values  represent  “mis- 
clustered”  cases  (Fig.  21.8). 
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Silhouette  plot  of  (x  =  clust_k2[subset_int]3  drst  =  dis) 

n  =  "j  oo  2  clusters  Cj 

j :  rij  |  ave^cj  Sj 


1  :  48  |  0.19 


2  :  52  |  0.10 
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Fig.  21.8  Silhouette  plot  of  the  2-class  k-means  clustering  of  the  Parkonson’s  disease  data 


clust_k2  =  cLust_kmeans2_pLus$cLuster 
require(cLuster) 

##  Loading  required  package:  c Luster 

#  X  =  as.matrix(input . short2) 

#  as  the  data  is  too  large  for  the  silhouette  plot,  we'll  just  subsample  and 
plot  100  random  cases 

subset_int  <-  sample (nrow(X) jlQQ)  #100  cases  from  661  total  cases 
dis  =  dist(as.data. frame (scale (X[subset_intj  ]))) 
sil_k2  =  silhouette(clust_k2[subset_int] jdis)  #best 
plot( sil_k2) 


s ummary ( si L_k 2) 

##  Silhouette  of  100  units  in  2  clusters  from  silhouette . default (x  =  clust_k 
2[subset_int] j  dist  =  dis)  : 

##  Cluster  sizes  and  average  silhouette  widths: 

##  48  52 

##  0.1895633766  0.1018642857 
##  Individual  silhouette  widths: 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  -0.06886907  0.06533312  0.14169240  0.14395980  0.22658680  0.33585520 

mean(sil_k2<0) 

##  [1]  0.01666666667 
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The  result  is  pretty  good.  Only  a  very  small  number  of  samples  are  “mis- 
clustered”  (having  negative  silhouette  values).  Furthermore,  you  can  observe  that 
when  k=3  or  k=4,  the  overall  silhouette  decreases,  which  indicates  suboptimal 
clustering. 

dis  =  dist (as. data. frame (scaLe(X) ) ) 

clust_kmeans3_plus  =  hmeans (scale (X) j  hpp_init ( scale (X) ,  3 ),  i ter.max=100j  a 
Lgorithm= 'Lloyd' ) 

summary ( silhouette(cLust_kmeans3_pLus$clusterj dis) ) 

##  Silhouette  of  422  units  in  3  clusters  from  silhouette .default (x  = 
c L us t_hmeans3_pl us$c Luster j  dist  =  dis)  : 

##  Cluster  sizes  and  average  silhouette  widths: 

##  139  157  126 

##  0.08356111542  0.19458813829  0.17237138090 
##  Individual  silhouette  widths: 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  -0.06355399  0.08376430  0.16639550  0.15138420  0.21855670  0.33107050 

clust_kmeans4_plus  =  hmeans(scale(X) ,  hpp_init ( scale (X) ,  4) ,  iter .max=100j  a 
Lgorithm= 'Lloyd' ) 

summary ( silhouette (clust_kmeans4_plus$clusterj dis ) ) 

##  Silhouette  of  422  units  in  4  clusters  from  silhouette .default (x  = 
clust_kmeans4_plus$c luster ,  dist  =  dis)  : 

##  Cluster  sizes  and  average  silhouette  widths: 

##  138  121  111  52 

##  0.124165755516  0.170228092125  0.193359499726  0.008929262925 
##  Individual  silhouette  widths: 

##  Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 

##  -0.16300240  0.08751445  0.15091580  0.14137370  0.21035560  0.32293680 

Then,  let’s  calculate  the  unsupervised  classification  error.  Here,  p  represents  the 
percentage  of  class  0  cases,  which  provides  the  weighting  factor  for  labelling  each 
cluster. 

mat  =  matrix(lj nrow  =  Length(Y) ) 
p  =  sum(Y==0)/ Length (Y) 
for  (i  in  1:2){ 

id  =  which(clust_h2==i) 
if(sum(Y[id]==0)>Length(id) *p){ 
mat[id]  =  0 

} 

} 

caret : : confusionMatrix(Yj mat ) 

##  Confusion  Matrix  and  Statistics 
## 


## 

Reference 

## 

Prediction  0  1 

## 

0  195  105 

## 

1  1  121 

## 

## 

Accuracy  : 

0.7488152 

## 

95%  Cl  : 

( 0.7 04593 3 j  0.7895087) 

## 

No  Information  Rate  : 

0.535545 
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Fig.  21.9  Multi-dimensional  scalling  plot  (2D  projection)  of  the  k-means  clustering  depicting  the 
agreement  between  testing  data  labels  (glyph  shapes)  and  the  predicted  class  lables  (glyph  colors) 


##  P-Value  [Acc  >  NIR] 
## 

##  Kappa 
##  Mcnemar's  Test  P-VaLue 
## 

##  Sensitivity 
##  Specificity 
##  Pos  Pred  Value 
##  Neg  Pred  Value 
##  Prevalence 
##  Detection  Rate 
##  Detection  Prevalence 
##  Balanced  Accuracy 
## 

##  'Positive'  Class 


<  0.00000000000000022204 
0.5122558 

<  0.00000000000000022204 

0.9948980 
0.5353982 
0. 6500000 
0.9918033 
0.4644550 
0.4620853 
0.7109005 
0.7651481 

0 


It  achieves  69%  accuracy,  which  is  reasonable  for  unsupervised  classification. 
Finally,  let’s  visualize  the  results  by  superimposing  the  data  into  the  first  two 
multi-dimensional  scaling  (MDS)  dimensions  (Fig.  21.9). 

Library ( "ggplot2 ") 

mds  =  as. data. frame (cmdscale(diSj  k=2)) 
mds_temp  =  cbind( 

mdSj  as. factor (clust_k2)) 
names (mds_temp)  =  c('Vl'j  'V2' ,  'cluster  k=2' ) 

gp_cluster  =  ggplot(mds_tempj  aes(x=V2}  y=Vl ,  color=as. factor (clust_k2)))  + 
geom_point(aes( shape  =  as.  factor (Y)))  +  theme () 
gp_cluster 
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21.7.8  Spectral  Clustering 

Suppose  the  multivariate  dataset  is  represented  as  a  set  of  data  points  A.  We  can 
define  a  similarity  matrix  S  =  s^,  where  represents  a  measure  of  the  similarity 
between  points  ij  E  A.  Spectral  clustering  uses  the  spectrum  of  the  similarity  matrix 
of  the  high-dimensional  data  and  performs  dimensionality  reduction  for  clustering 
into  fewer  dimensions.  The  spectrum  of  a  matrix  is  the  set  of  its  eigenvalues.  In 

linear  operator  .  .  . 

general,  if  M  :  12 - >  L2  maps  a  vector  space  12  into  itself,  its  spectrum  is  the 

set  of  scalars  A  =  {i/}  such  that  (T  —  AJ)v  =  0,  where  I  is  the  identity  matrix  and  v  are 
the  eigen  vectors  (or  eigen-functions)  for  the  operator  T.  The  determinant  of  the 
matrix  equals  the  product  of  its  eigenvalues,  i.e.,  det(T)  =  77/2/,  the  trace  of  the 
matrix  tr(T)  =  ZjAh  and  the  pseudo -determinant  for  a  singular  matrix  is  the  product 
of  its  nonzero  eigenvalues,  pseudodet{T)  =  Tl^oAt. 

To  partition  the  data  into  two  sets  (Si,S2),  denote  v  to  be  the  second-smallest 
eigenvector  of  the  Laplacian  matrix: 

L  =  I-  D~^SD^ 

of  the  similarity  matrix  S',  where  D  is  the  diagonal  matrix  Dit  = 

This  actual  (Si,  S2)  partitioning  of  the  cases  in  the  data  may  be  done  in  different 
ways.  For  instance,  Si  may  use  the  median  m  of  the  components  in  v  and  group  all 
data  points  whose  component  in  v  is  greater  than  m.  Then,  the  remaining  cases  can  be 
labeled  as  part  of  S2.  This  approach  may  be  used  iteratively  for  hierarchical 
clustering  by  repeatedly  partitioning  the  subsets. 

The  specc  method  in  the  kernlab  package  implements  a  spectral  clustering 
algorithm  where  the  data-clustering  is  performed  by  embedding  the  data  into  the 
subspace  of  the  eigenvectors  of  an  affinity  matrix. 

#  install . packages( "kernlab" ) 

Library ("kern  Lob") 

#  review  and  choose  a  dataset  (for  example  the  Iris  data 
data( ) 

#plot (iris) 

Let’s  look  at  a  few  simple  cases  of  spectral  clustering.  We  are  suppressing  some 
of  the  outputs  to  save  space  (e.g.,  #plot  (my_data,  col=  data_sc) ). 


Iris  Petal  Data 

Let’s  look  at  the  iris  dataset  we  saw  in  Chap.  3. 

my_data  <-  iris ;  data(my_data) 
num_cLusters  <-  3 

data_sc  <-  specc (my _data}  centers=  num_cLusters) 
data_sc 

centers ( data_sc ) 
withinss ( data_sc ) 

#plot (my_data ,  col=  data_sc) 
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Fig.  21.10  Spirals  data  spectal  clustering  results 

Spirals  Data 

Another  simple  dataset  is  "kernlab  :  :  spirals  (Fig.  21.10). 


#  library( "kernlab" ) 
dota(spiraLs) 
num_cLusters  <-  2 

data_sc  <-  specc(spiralsj  centers=  num_cLusters) 
data_sc 

##  Spectra L  Clustering  object  of  class  "specc" 

## 

##  Cluster  memberships : 

## 

##1122122212211221111122122221112122121 

21122221111121211222111112212111221222 

11112121211111121122212222111221222121 

11122112111212111112222211122121112221 

22222211112121112111121211122111222112 

22222221111212212122222121212111222211 

12112222211222211122222211221111111222 

21212112221211112212122212122112221 

## 

##  Gaussian  Radial  Basis  kernel  function. 

##  Hyperparameter  :  sigma  =  367.501471756465 

## 

##  Centers: 

##  [}1]  [j2] 

##  [1}]  0.01997200551  -0.1761483316 

##  [2j]  -0.01770984369  0.1775136857 

## 

##  Cluster  size : 

##  [1]  150  150 
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## 

##  Within -cluster  sum  of  squares: 

##  [1]  117.3429096  118.1182272 

centers ( data_sc ) 

##  [A]  [j2] 

##  [lj]  0.01997200551  -0.1761483316 
##  [2/]  -0.01770984369  0.1775136857 

withinss ( data_sc ) 

##  [1]  117.3429096  118.1182272 

plot(spiralSj  col=  data_sc) 


Income  Data 

A  customer  income  dataset  representing  a  marketing  survey  is  included  in 
kernlab  :  :  income  (Fig.  21.11). 
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Fig.  21.11  Pair  plots  of  the  two-class  spectral  clustering  of  the  income  dataset 
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data  (income) 
num_ciusters  <-  2 

data_sc  <-  specc (income ,  centers=  num_cLusters) 
data_sc 

##  Spectra L  Clustering  object  of  class  "specc" 

## 

##  Cluster  memberships : 

## 

##21222121211121 

## 

##  String  kernel  function.  Type  =  spectrum 
##  Hyperparameters  :  sub-sequence/string  Length  =  4 

##  Normalized 
## 

##  Cluster  size: 

##  [1]  77 

centers ( data_sc ) 

##  [,l] 

##  [1,]  NA 

withinss ( data_sc ) 

##  logical(Q) 

plot (income,  col=  data_sc) 


21.8  Compare  the  Results 


Now  let’s  compare  all  eight  classifiers  (AdaBoost ,  LDA,  QDA,  knn,  logit, 
Neural  Network,  linear  SVM  and  Gaussian  SVM)  we  presented  above 
(Table  21.1). 

#  get  AdaBoost  CV  results 
set.seed(123) 

cv .out .ada  <-  crossval: : crossval(my . ada,  as .data. frame (X) ,  Y,  K  =  5,  B  =  1, 
negative  =  neg) 

##  Number  of  folds:  5 

##  Total  number  of  CV  fits:  5 

## 

##  Round  #  1  of  1 
##  CV  Fit  #  1  of  5 
##  CV  Fit  #  2  of  5 
##  CV  Fit  #  3  of  5 
##  CV  Fit  #  4  of  5 
##  CV  Fit  #  5  of  5 

#  get  k-Means  CV  results 

my.kmeans  <-  function  (train. x,  train. y,  test.x,  test.y,  negative,  formula){ 
kmeans .fit  <-  kmeans (scale (test .x) ,  kpp_init(scale(test .x) ,  2), 
iter.max=100,  algorithm= ' Lloyd ' ) 
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predict. y  <-  kmeans .fit$cLuster 
#count  TP,  FP,  TN,  FN,  Accuracy,  etc. 

out  <-  con f usionMatrix (test ,y,  predict ,y,  negative  =  negative) 

#  negative  is  the  label  of  a  negative  "null"  sample  (default:  "control"). 
return  (out) 

} 

set.seed(123) 

cv .out .kmeans  <-  crossvaL: :crossvaL(my .kmeans ,  as  .data,  frame  (X) ,  Y>  K  =  5, 

B  =  2j  negative  =  neg) 

##  Number  of  folds:  5 

##  Total  number  of  CV  fits:  10 

## 

##  Round  #  1  of  2 
##  CV  Fit  #  1  of  10 
##  CV  Fit  #  2  of  10 
##  CV  Fit  #  3  of  10 
##  CV  Fit  #  4  of  10 
##  CV  Fit  #  5  of  10 
## 

##  Round  #  2  of  2 
##  CV  Fit  #  6  of  10 
##  CV  Fit  #  7  of  10 
##  CV  Fit  #  8  of  10 
##  CV  Fit  #  9  of  10 
##  CV  Fit  #  10  of  10 

#  get  spectral  clustering  CV  results 

my. sc  <-  function  (train. x,  train. y,  test.x,  test.y,  negative ,  formula){ 
sc. fit  <-  specc(scale(test .x)j  centers=  2) 
predict. y  <-  sc .fit@.Data 
#count  TP,  FP,  TN,  FN,  Accuracy,  etc. 

out  <-  confusionMatrix(test.y ,  predict ,y,  negative  =  negative) 

#  negative  is  the  label  of  a  negative  "null"  sample  (default:  "control"). 
return  (out) 

} 

set.seed(123) 

cv.out.sc  <-  crossvaL: :crossval(my.sCj  as .data. frame (X) ,  Y}  K  =  5j  B  =  2, 
negative  =  neg) 

##  Number  of  folds:  5 

##  Total  number  of  CV  fits:  10 

## 

##  Round  #  1  of  2 
##  CV  Fit  #  1  of  10 
##  CV  Fit  #  2  of  10 
##  CV  Fit  #  3  of  10 
##  CV  Fit  #  4  of  10 
##  CV  Fit  #  5  of  10 
## 

##  Round  #  2  of  2 
##  CV  Fit  #  6  of  10 
##  CV  Fit  #  7  of  10 
##  CV  Fit  #  8  of  10 
##  CV  Fit  #  9  of  10 
##  CV  Fit  #  10  of  10 
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require  (kni.tr) 

##  Loading  required  package:  knitr 

res_tab=rbind(diagnosticErrors(cv . out . ada$stat) j diagnosticErrors ( 
cv.out.  Lda$stat ) j  diagnosticErrors ( cv . out . qda$stat ) j  diagnosticErrors ( 
cv . out . knn$stat ) j  diagnosticErrors ( cv . out .  Logit$stat) } diagnosticErrors ( 
cv.out .nn$stat)j diagnosticErrors ( cv . out. svmL$stat) , diagnosticErrors ( 
cv . out . svmg$stat ) j  diagnosticErrors (cv . out . kmeans$stat ) j diagnosticErrors ( 
cv.out . sc$stat) ) 

rownames ( res_tab )  <-  c("AdaBoost" }  " LDA ",  "QDA ",  "knn"J  " Logit ", 

"NeuraL  Network" ,  "Linear  SVM" ,  "Gaussian  SI/A?" ,  "k-Means" , 

"Spectra L  CLustering") 

kabie(res_tabj  caption  =  "Compare  ResuLt") 

Leaving  knn,  kmeans  and  specc  aside,  the  other  methods  achieve  pretty  good 
results.  In  the  PD  case  study,  the  reason  for  suboptimal  results  in  some  clustering 
methods  may  be  rooted  in  lack  of  training  (e.g.,  specc  and  kmeans)  or  the  curse  of 
(high)  dimensionality,  which  we  saw  in  Chap.  7.  As  the  data  are  rather  sparse, 
predicting  from  the  nearest  neighbors  may  not  be  too  reliable. 


21.9  Assignment:  21.  Prediction  and  Internal  Statistical 
Cross- V  alidation 

Demonstrate  cross-validation  on  these  two  case-studies  independently: 

•  Example  1 :  ALS  (Amyotrophic  Lateral  Sclerosis) 

•  Example  2:  Quality  of  Life  in  Chronic  Illness 

(Case06_QoL_Symptom_ChronicIllness.csv) 

Go  through  the  following  protocol: 

•  Review  the  case-study. 

•  Choose  appropriate  dichotomous,  polytomous  or  continuous  outcome  variables, 
e.g.,  use  ALS FRS_s lope  for  ALS,  CHRONICDISEASESCORE  for  case  06  and 
cast  them  as  dichotomous  outcomes. 

•  Apply  appropriate  data  preprocessing. 

•  Perform  regression  modeling  (e.g.,  OLS,  glmnet,  Lorward  or  Backward  model 
selection,  etc.)  for  continuous  outcomes. 

•  Perform  classification  and  prediction  using  various  methods  (e.g.,  LDA,  QDA, 
AdaBoost,  SVM,  Neural  Network,  KNN)  for  discrete  outcomes. 

•  Apply  cross-validation  on  these  regression  and  classification  methods, 
respectively. 

•  Report  standard  error  for  regression  approaches. 

•  Report  appropriate  quality  metrics  that  can  be  used  to  rank  the  forecasting 
approaches  based  on  the  predictive  power  of  their  results. 

•  Compare  the  result  of  model-driven  and  data-driven  (e.g.,  KNN). 
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•  Compare  the  sensitivity  and  specificity. 

•  Use  unsupervised  classification  methods  ( k-Means )  and  spectral  clustering. 

•  Evaluate  and  justify  a  k-Means  model  and  report  the  agreement  of  the  derived 
clusters  and  the  real  labels. 

•  Report  the  classification  error  of  k-means  and  also  compare  with  the  result  of 
k-means++. 
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Chapter  22 

Function  Optimization 


® 

Check  for 
updates 


Most  data-driven  scientific  inference,  qualitative,  quantitative,  and  visual  analytics 
involve  formulating,  understanding  the  behavior  of,  and  optimizing  objective  (cost) 
functions.  Presenting  the  mathematical  foundations  of  representation  and  interroga¬ 
tion  of  diverse  spectra  of  objective  functions  provides  mechanisms  for  obtaining 
effective  solutions  to  complex  big  data  problems.  (Multivariate)  function  optimiza¬ 
tion  (minimization  or  maximization)  is  the  process  of  searching  for  variables  jq,  v2, 
x3,  . . .,  xn  that  either  minimize  or  maximize  the  multivariate  cost  function  f(x1,x2, 
x3,  ...,xn).  In  this  chapter,  we  will  specifically  discuss  (1)  constrained  and 
unconstrained  optimization;  (2)  Lagrange  multipliers;  (3)  linear,  quadratic  and 
(general)  non-linear  programming;  and  (4)  data  denoising. 


22.1  Free  (Unconstrained)  Optimization 

We  will  start  with  function  optimization  without  restrictions  for  the  domain  of  the 
cost  function,  12  3  {v/}.  The  extreme  value  theorem  suggests  that  a  solution  to  the 
free  optimization  processes,  minXuX2,X3, (vi ,  x2,  x3 , . . . ,  xn)  or 
maxXl ...,x,/(vi, *2, x3, . . . , xn),  may  be  obtained  by  a  gradient  vector  descent 
method.  This  means  that  we  can  minimize/maximize  the  objective  function  by  finding 

solutions  to  V/  =  j  . . . ,  =  {0, 0, . . . ,  0}.  Solutions  to  this  equation,  jq, . . ., 

xm  will  present  candidate  (local)  minima  and  maxima. 

In  general,  identifying  critical  points  using  the  gradient  or  tangent  plane,  where 
the  partial  derivatives  are  trivial,  may  not  be  sufficient  to  determine  the  extrema 
( minima  or  maxima)  of  multivariate  objective  functions.  Some  critical  points  may 
represent  inflection  points,  or  local  extrema  that  are  far  from  the  global  optimum  of 
the  objective  function.  The  eigenvalues  of  the  Hessian  matrix,  which  includes  the 
second  order  partial  derivatives,  at  the  critical  points  provide  clues  to  pinpoint 
extrema.  For  instance,  invertible  Hessian  matrices  that  (i)  are  positive  definite  (i.e., 
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all  eigenvalues  are  positive),  yield  a  local  minimum  at  the  critical  point,  (ii)  are 
negative  definite  (all  eigenvalues  are  negative)  at  the  critical  point  suggests  that  the 
objective  function  has  a  local  maximum,  and  (iii)  have  both  positive  and  negative 
eigenvalues  yield  a  saddle  point  for  the  objective  function  at  the  critical  point  where 
the  gradient  is  trivial. 

There  are  two  complementary  strategies  to  avoid  being  trapped  in  local  extrema. 
First,  we  can  run  many  iterations  with  different  initial  vectors.  At  each  iteration,  the 
objective  function  may  achieve  a  (local)  maximum/minimum/saddle  point.  Finally, 
we  select  the  overall  minimal  (or  maximal)  value  from  all  iterations.  Another 
adaptive  strategy  involves  either  adjusting  the  step  sizes  or  accepting  solutions  in 
probability ,  e.g.,  simulated  annealing  is  one  example  of  an  adaptive  optimization. 


22.1.1  Example  1:  Minimizing  a  Univariate  Function 
(Inverse-CDF) 

The  cumulative  distribution  function  (CDF)  of  a  real- valued  random  process  X,  also 
known  as  the  distribution  function  of  X,  represents  the  probability  that  the  random 
variable  X  does  not  exceed  a  certain  level.  Mathematically  speaking,  the  CDF  of  X  is 
Fx(x)  =  P(X  <  v).  Recall  the  Chap.  2  discussions  of  Uniform,  Normal,  Cauchy, 
Binomial,  Poisson  and  other  discrete  and  continuous  distributions.  Also  explore  the 
dynamic  representations  of  density  and  distribution  functions  included  in  the  Prob¬ 
ability  Distributome  Calculators  (http://distributome.org). 

For  each  p  £  [0, 1],  the  inverse  distribution  function,  also  called  quantile  function 
(e.g.,  qnorm),  yields  the  critical  value  (v)  at  which  the  probability  of  the  random 
variable  is  less  than  or  equal  to  the  given  probability  ( p ).  When  the  CDF  Fx  is 
continuous  and  strictly  increasing,  the  value  of  the  inverse  CDF  at  p,  F~l(p)  =  x,  is 
the  unique  real  number  v  such  that  F(x)  =  p. 

Below,  we  will  plot  the  probability  density  function  (PDF)  and  the  CDF  for 
Normal  distribution  in  R  (Fig.  22.1). 


par (mf row=c (1,2) ,  mar=c ( 3 ,  4  ,  4 , 2  ) ) 

z<-seq(-4,  4,  0.1)  #  points  from  -4  to  4  in  0.1  steps 

q<-seq ( 0 . 001 ,  0.999,  0.001)  #  probaility  quantile  values  from  0.1% 

to  99.9%  in  0.1%  steps 

dS tandardNormal  <-  data . frame ( Z=z ,  Densi ty=dnorm ( z ,  mean=0,  sd=l), 
Dis tribution=pnorm ( z ,  mean=0,  sd=l)) 

plot(z,  dS tandardNormal$Densi ty,  col="darkblue",xlab="z", 
ylab="Densi ty" ,  type="l", lwd=2,  cex=2,  main="Standard  Normal  PDF", 
cex . axis=0 . 8 ) 

#  could  also  do 

#  xseq<-seq (-4 ,  4,  0.01);  density<-dnorm (xseq,  0 ,  1 )  ;  plot  ( density , 
main="Densi  ty" ) 

#  Compute  the  CDF 

xseq<-seq ( -4 ,  4,  0.01);  cumulativec-pnorm (xseq,  0,  1) 

#  plot  (cumulative ,  main=" CDF" ) 

plot (xseq,  cumulative,  col="darkred",  xlab=M",  ylab="Cumulative 
Probability",  type="l" , lwd=2,  cex=2,  main="CDF  of  (Simulated) 
Standard  Normal",  cex.axis=.8) 
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Standard  Normal  PDF  CDF  of  (Simulated)  Standard  Norm 


Fig.  22.1 


Plots  of  the  density  and  cumulative  distribution  functions  of  the  simulated  data 


Suppose  we  are  interested  in  computing,  or  estimating,  the  inverse-CDF  from  first 
principles.  Specifically,  to  invert  the  CDF,  we  need  to  be  able  to  solve  the  following 
equation  (representing  our  objective  function): 

CDF(x )  -  p  =  0. 

The  uniroot  and  stats  :  :  nlm  R  functions  do  non-linear  minimization  of  a 
function /using  a  Newton-Raphson  algorithm. 

set.seed(1234) 
x  <-  rnorm(1000,  100 ,  20) 
pdf_x  <-  density (x) 

#  Interpolate  the  density,  the  values  returned  when  input  x  values  are  outsi 
de  [min(x):  max(x)]  should  be  trivial 

f_x  <-  approxfun(pdf_x$Xj  pdf_x$y,  yLeft=0j  yright=0) 

#  Manual  computation  of  the  cdf  by  numeric  integration 
cdf_x  <-  function(x){ 

v  <-  integrate (f_Xj  -Inf,  x)$vaiue 
if  (v<0)  v  <-  0 
eLse  if(v>l)  v  <-  1 
return ( v) 

} 

#  Finding  the  roots  of  the  inverse-CDF  function  by  hand  (CDF(x)-p=0) 
invcdf  <-  function(p){ 

uniroot (function (x) {cdf _x(x)  -  p},  range(x) )$root 

#  alternatively,  can  use 

#  nlm(function(x){cdf_x(x)  -  p},  0)$estimate 

#  minimum  -  the  value  of  the  estimated  minimum  of  f. 

#  estimate  -  the  point  at  which  the  minimum  value  of  f  is  obtained. 

} 
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invcdf(0.5) 

##  [1]  99.16995 

#  We  can  validate  that  the  invense-CDF  is  correctly  computed:  FA{ -l}(F(x) )== 
x 

cdf_x( invcdf( 0.8)) 

##  [1]  0.8 

The  ability  to  compute  exactly,  or  at  least  estimate,  the  inverse-CDF  function  is 
important  for  many  reasons.  For  instance,  generating  random  observations  from  a 
specified  probability  distribution  (e.g.,  normal,  exponential,  or  gamma  distribution) 
is  an  important  task  in  many  scientific  studies.  One  approach  for  such  random 
number  generation  from  a  specified  distribution  evaluates  the  inverse  CDF  at 
random  uniform  u  ~  U( 0, 1)  values.  Recall  that  in  Chap.  16  we  showed  an  example 
of  generating  random  uniform  samples  using  atmospheric  noise.  The  key  step  is 
ability  to  quickly,  efficiently  and  reliably  estimate  the  inverse  CDF  function,  of 
which  we  just  showed  one  example. 

Let’s  see  why  inverting  the  CDF  using  random  uniform  data  works.  Consider  the 
cumulative  distribution  function  (CDF)  of  a  probability  distribution  from  which  we 
are  interested  in  sampling.  If  the  CDF  has  a  closed  form  analytical  expression  and  is 
invertible,  then  we  generate  a  random  sample  from  that  distribution  by  evaluating  the 
inverse  CDF  at  u,  where  u  ~  U( 0, 1).  This  is  possible  since  a  continuous  CDF,  F,  is  a 
one-to-one  mapping  of  the  domain  of  the  CDF  (range  of  X )  into  the  interval  [0, 1]. 
Therefore,  if  U  is  a  uniform  random  variable  on  [0, 1],  then  X  =  F~l(U)  has  the 
distribution  F.  Suppose  U  ~  Uniform^ 0, 1],  then  P(F~l(U)  <  x)  =  P(U  <  F(v)),  by 
applying  F  to  both  sides  of  this  inequality,  since  F  is  monotonic.  Thus,  F(F_1 
(U)  <  v)  =  F(v),  since  P(U  <  u)  =  u  for  uniform  random  variables. 


22.1.2  Example  2:  Minimizing  a  Bivariate  Function 

Let’s  look  at  the  function  f(xi,x2)  =  ( Xi  —  3)  +  (x2  +  4)  .  We  define  the  function  in  R 
and  utilize  the  optim  ( )  function  to  obtain  the  extrema  points  in  the  support  of  the 
objective  function  and/or  the  extrema  values  at  these  critical  points. 

require ("stats ") 

f  <-  function(x)  {  (x[l]  -  3)A2  +  (x[2]  +4)A2  } 
initiaL_x  <-  c(0}  - 1 ) 

x_optimaL  <-  optim(initiaL_Xj  method="CG" )  #  performs  minimization 

x_min  <-  x_optimaL$par 

#  x_min  contains  the  domain  values  where  the  (local)  minimum  is  attained 
x_min  #  critical  point/vector 

##  [1]  3  -4 

x_optimaL$vaiue  #  extrema  value  of  the  objective  function 
##  [1]  8.450445e-15 
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optim  allows  the  use  of  six  candidate  optimization  strategies: 

•  Nelder-Mead:  robust  but  relatively  slow,  works  reasonably  well  for 
non-differentiable  functions. 

•  BFGS:  quasi-Newton  method  (also  known  as  a  variable  metric  algorithm),  uses 
function  values  and  gradients  to  build  up  a  picture  of  the  surface  to  be  optimized. 

•  CG:  conjugate  gradients  method,  fragile,  but  successful  in  larger  optimization 
problems  because  it’s  unnecessary  to  save  large  matrix. 

•  L-BFGS-B:  allows  box  constraints. 

•  SANN:  a  variant  of  simulated  annealing,  belonging  to  the  class  of  stochastic 
global  optimization  methods. 

•  Brent:  for  one-dimensional  problems  only,  useful  in  cases  where  optim  ( )  is 
used  inside  other  functions  where  only  method  can  be  specified. 


22.1.3  Example  3:  Using  Simulated  Annealing  to  Find 
the  Maximum  of  an  Oscillatory  Function 

Consider  the  function  fix)  =  10  sin  (0.3v)  x  sin  (1.3v2)  —  0.00002v4  +  0.3v  +  35. 
Maximizing /()  is  equivalent  to  minimizing  — /().  Let’s  plot  this  oscillatory  function, 
then  find  and  report  its  critical  points  and  extremum  values. 

The  function  optim  returns  two  important  results: 

•  par:  the  best  set  of  domain  parameters  found  to  optimize  the  function 

•  value:  the  extreme  values  of  the  function  corresponding  to  par  (Fig.  22.2). 


optim()  minimizing  an  oscillatory  function 


Fig.  22.2  Example  of  minimizing  and  oscillatory  function,  fix)  =  10  sin  (0.3x)  x  sin  (1 .3x  ) 
0.00002x4  +  0.3x  +  35,  using  optim 
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funct_osc  <-  function  (x)  {  - (10*sin(0. 3*x)*sin(l. 3*xA2)  -  0.00002*xA4  + 
0.3*x+35)  } 

pLot(funct_osCj  -50j  50 }  n  =  1000 }  main  =  "optim()  minimizing  an  osciLLatory 
function ") 

abiine(v=17j  lty=3}  Lwd=4j  coL="red") 


res  <-  optim(16j  funct_osCj  method  =  " SANN ”,  control  =  List(maxit  =  20000 }  t 

emp  =  20 j  parscale  =  20)) 

res$par 

##  [1]  15.66197 

res$value 

##  [1]  -48.49313 

22.2  Constrained  Optimization 
22.2.1  Equality  Constraints 

When  there  are  support  restrictions,  dependencies,  or  other  associations  between  the 
domain  variables  x\,  x2,  . . xn,  constrained  optimization  needs  to  be  applied. 

For  example,  we  can  have  k  equations  specifying  these  restrictions,  which  may 
specify  certain  model  characteristics: 

{8i(x i,x2,  ...,x„)=0 

... 

gk(x  1,X2,  ...,x„)=0 

Note  that  the  right  hand  sides  of  these  equations  may  always  be  assumed  to  be 
trivial  (0),  otherwise  we  can  just  move  the  non-trivial  parts  within  the  constraint 
functions  gt.  Linear  Programming,  Quadratic  Programming,  and  Lagrange  multi¬ 
pliers  may  be  used  to  solve  such  equality-constrained  optimization  problems. 


22.2.2  Lagrange  Multipliers 

We  can  merge  the  equality  constraints  within  the  objective  function  (/  — »  /*). 
Lagrange  multipliers  represent  a  typical  solution  strategy  that  turns  the  constrained 
optimization  problem  (min/(v)  subject  to  gi(xi,x2, . .  .,xn),  1  <  i  <  k ),  into  an 
unconstrained  optimization  problem: 


f  (-^1 7  -Lz 5  ^1 5  ^2  )  f  (-^1 7  LZ  7  •  •  •  7  2Cyi )  H-  ^  ^  ^i§ i  (-^1 7  LZ  7  •  •  •  7  -^n  ) 


i=  1 
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Then,  we  can  apply  traditional  unconstrained  optimization  schemas,  e.g.,  extreme 
value  theorem,  to  minimize  the  unconstraint  problem: 


/*0 1,*2: 


5  A  1 7  4 1'''')  4  )  y  4 1 5  -LZ 1'''')  )  H-  ^  1  <g  1  (-^1  ?  -LZ 1 

H-  0^1 5  LZ L w  )  • 


+ 


This  represents  an  unconstrained  optimization  problem  using  Lagrange 
multipliers. 

The  solution  of  the  constrained  problem  is  also  a  solution  to: 


df  df  df  df  df 
dx  i  ’  dv2  ’  ’  dx„  dA  i 7  J4  ’ 


df_ 

dAfc 


22.2.3  Inequality  Constrained  Optimization 

There  are  no  general  solutions  for  arbitrary  inequality  constraints;  however,  partial 
solutions  do  exist  when  some  restrictions  on  the  form  of  constraints  are  present. 

When  both  the  constraints  and  the  objective  function  are  linear  functions 
of  the  domain  variables,  then  the  problem  can  be  solved  by  Linear  Programming. 


Linear  Programming  (LP) 

LP  works  when  the  objective  function  is  a  linear  function.  The  constraint  functions 
are  also  linear  combination  of  the  same  variables. 

Consider  the  following  elementary  (minimization)  example: 

min  (—3xi  —  4^2  —  3x3) 

*1,*2,*3 


subject  to: 


{6x1  +  2x2  +  4x3  <150 

X\  X2~\~  6x3  >  0 

4xi  +  5x2  +  4x3  =  40 

The  exact  solution  is  xi  =  0,  x2  =  8,  x3  =  0,  and  can  be  computed  using  the 
package  IpSolveAPI  to  set  up  the  constraint  problem  and  the  generic  solve  ( ) 
method  to  find  its  solutions. 


742 


22  Function  Optimization 


#  instaL L . packages ( "LpSolveAPI ") 
library (IpSolveAPI) 

lps. model  <-  make.lp(0,  3)  #  define  3  variables 

#  add  the  constraints  as  a  matrix  of  the  Linear  coefficients }  relations  and 


RHS 


add.constraint(lps .model,  c(6}  2,  A),  150) 

add. constraint(lps .model,  c(l,  1}  6),  0) 

add. constraint(lps .model,  c(4^  5,  4)^  "="  40) 

#  set  objective  function  (default:  find  minimum) 
set. objfn(lps. models  c(-3,  -A,  -3)) 

#  you  can  save  the  model  to  a  file 

#  write.  Lp( Lps. model j  ' c : /Users/LPmodel . Lp ' j  type='Lp') 

#  these  commands  define  the  constraint  Linear  model 

#  /*  Objective  function  */ 

#  min:  -3  xl  -4  x2  -3  x3; 

# 

#  /*  Constraints  */ 

#  +6  xl  +2  x2  +4  x3  <=  150; 

#  +  xl  +  x2  +6  x3  >=  0; 

#  +4  xl  +5  x2  +4  x3  =  40; 

# 

#  writing  it  in  the  text  file  named  ' LPmodel .  Lp ' 
solve(lps .model) 


##  [1]  0 


#  Retrieve  the  values  of  the  variables  from  a  solved  Linear  program  model 
get. variables(lps. model)  #  check  against  the  exact  solution  x_l  =  0} 
x_2  =  8j  x_3  =  0 

##  [1]  0  8  0 

get. objective(lps. model)  #  get  optimal  (min)  value 
##  [1]  -32 

In  lower  dimensional  problems,  we  can  also  plot  the  constraints  to  graphically 
demonstrate  the  corresponding  support  restriction.  For  instance,  here  is  an  example 
of  a  simpler  2D  constraint  and  its  Venn  diagrammatic  representation  (Fig.  22.3). 


L ibrary ( ggp Lot 2) 

ggplot(data.frame(x  =  c(-100}  0)),  aes(x  =  x))  + 

stat_f unction (fun= function (x)  {(150-2*x)/6}j  aes(color=" Function  1"))  + 
stat_f unction (fun= function (x)  {  -x  }}  aes(color  =  "Function  2"))  + 
theme_bw( )  + 

scale_color_discrete(name  =  "Function")  + 
geom_polygon( 
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Function 
Function  1 
Function  2 

Constraint  Set 

Constraint  1 
Constraint  2 


Fig.  22.3 


A  2D  graphical  depiction  of  the  function  optimization  support  restriction  constraints 


data  =  data.frame(x  =  c(-100,  -100,  0,  0,  Inf),  y  =  c(0, 350/6,  150/6 , 

0,  0)), 

aes(x  =  x,  y  =  y,  fiLL  =  "Constraint  l")j 
inherit .aes  =  FALSE,  aLpha  =  0.5)  + 
geom_poiygon( 

data  =  data.frame(x  =  c(-100,  -100,  0,  Inf),  y  =  c(0,  100,  0,  0)), 
aes(x  =  x,  y  =  y,  fiLL  =  "Constraint  2"), 
inherit . aes  =  FALSE,  aLpha  =  0.3)  + 
sea Le_fiLL_discrete( name  =  "Constraint  Set")  + 
scaLe_y_continuous( Limits  =  c(0,  100)) 

Here  is  another  example  of  maximization  of  a  trivariate  cost  function, 
fix i,  x2,x3)  =  3xi  +  4x2  —  x3,  subject  to: 


—X\  +  2^2  H-  3x3 

3x\  —  x2  —  6x3 


X\  -X2 


<  16 
>0  . 

<  2 
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lps.model2  <-  make.lp(0,  3) 

add.constraint(lps.model2J  c(-l,  2,  3),  "<  =  ",  16) 
add. constraint (lps .model2J  c(3^  -1,  - 6 ),  ">=",  0) 

add. constraint (lps .model2J  c(l,  -1,  0) }  2) 

set .objfn(lps .model2J  c(3,  4,  - 1 ),  indices  =  c(l,  2}  3)) 

lp. control ( lps .model2J  sense=,max')  #  changes  to  max:  3  xl  +  4  x2  -  x3 


##  $anti. degen 

##  [1]  "f ixedvars"  "stalling" 

## 

##  $basis. crash 
##  [1]  "none" 

## 

##  $bb . depthlimit 
##  [1]  -50 
## 

##  $bb.floorfirst 
##  [1]  "automatic" 

## 

##  $bb.rule 

##  [1]  "pseudononint"  "greedy" 

## 

##  $break. at .first 
##  [1]  FALSE 
## 

##  $break. at .value 
##  [1]  le+30 
## 

##  $epsilon 

##  epsb  epsd  epsel 

##  le-10  le-09  le-12 

## 

##  $improve 

##  [1]  "dualfeas"  "thetagap" 

## 

##  $infinite 
##  [1]  le+30 
## 

##  $maxpivot 
##  [1]  250 
## 

##  $mip.gap 
##  absolute  relative 
##  le-11  le-11 

## 

##  $negrange 
##  [1]  -le+06 
## 

##  $obj . in . basis 
##  [1]  TRUE 
## 

##  $pivoting 

##  [1]  "devex"  "adaptive" 

## 

##  $presolve 
##  [1]  "none" 


"dynamic"  "rcostf ixing" 


epsint  epsperturb  epspivot 

le-07  le-05  2e-07 
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## 

##  $scalelimit 
##  [1]  5 
## 

##  $scaling 

##  [1]  "geometric"  "equilibrate"  "integers" 

## 

##  $sense 

##  [1]  "maximize" 

## 

##  $simplextype 

##  [1]  "dual"  "primal" 

## 

##  $timeout 
##  [1]  0 
## 

##  $verbose 
##  [1]  "neutral" 

solve(lps .model2)  #  0  suggests  that  this  soLution  convergences 

##  [1]  0 

get . variables(lps . model2)  #  get  point  of  maximum 
##  [1]  20  18  0 

get.objective(lps.model2)  #  get  optimal  (max)  value 
##  [1]  132 

In  3D,  we  can  utilize  the  rgl :  :  surf  ace  3d  ( )  method  to  display  the  con¬ 
straints.  This  output  is  suppressed,  as  it  can  only  be  interpreted  via  the  pop-out  3D 
rendering  window. 


Library ( "rgl ") 
n  <-  100 

x  <-  y  <-  seq(-500j  500 }  length  =  n) 
region  <-  expand. grid (x  =  x,  y  =  y) 


z2  <-  matrix(-region$x  +  6*region$yj  n,  n) 


surface3d(Xj  y,  zl}  back 
alpha  =  0.4) 

surface3d(Xj  y ,  z2}  back 
1 . 5j  alpha  =0.4) 
surface3d(Xj  y}  z3}  back 
alpha  =  0.4) 
axes3d() 


/6)j  n j 

n) 

'j  ") 

'Line 

col  = 

'red' }  Lwd  =  1.5 } 

'Line 

col  = 

' orange 'j  lwd  = 

'Line 

col  = 

'blue ' j  Lwd  =  1.5 , 

It  is  possible  to  restrict  the  domain  type  to  contain  only  solutions  that  are: 

•  integers ,  which  makes  it  an  Integer  Linear  Programming  (ILP), 

•  binary/boolean  values  (BLP),  or 

•  mixed  types,  Mixed  Integer  Liner  Programming  (MILP). 

Some  examples  are  included  below. 


746 


22  Function  Optimization 


Mixed  Integer  Linear  Programming  (MILP) 

Let’s  demonstrate  MILP  with  an  example  where  the  type  of  Xi  is  unrestricted,  x2  is 
dichotomous  (binary),  and  x3  is  restricted  to  be  an  integer. 

ips. model  <-  make. Lp(0j  3) 

add. constraint (Lps .model j  c(6}  2,  4) ,  "<="j  150) 
add. constraint (Ips .model j  c(l}  1}  6) }  ">="}  0) 
add. constraint (Lps .model j  c(4}  5,  4),  40) 

set .objfn(lps .model j  c(-3,  - 4 }  -3)) 

set.type(Lps.modelj  2,  "binary") 
set.type(lps.modelj  3,  "integer" ) 

get ,type( Lps .model)  #  This  is  Mixed  Integer  Linear  Programming  (MILP) 

##  [1]  "real"  "integer"  "integer" 

set. bounds (lps. model j  lower=-5j  upper=5j  columns=c(l) ) 

#  give  names  to  columns  and  restrictions 

dimnames (Lps. model)  <-  List(c("Rl"J  "R2" ,  " R3 "),  c(”xl" }  "x2 ",  "x3")) 

print ( lps . model ) 

##  Model  name: 

##  xl 

##  Minimize  -3 
##  R1  6 

##  R2  1 

##  R3  4 

##  Kind  Std 

##  Type  Real 

##  Upper  5 

##  Lower  -5 

solve( Lps .model ) 

##  [1]  0 

get .objective ( Lps .model) 

##  [1]  -30.25 
get. variables ( Lps .model ) 

##  [1]  4.75  1.00  4.00 
get. constraints ( Lps .model ) 

##  [1]  46.50  29.75  40.00 


x2 

-4 

2 

1 

5 

Std 

Int 

1 

0 


x3 

-3 

4 

6 

4 

Std 

Int 

Inf 

0 


150 

0 

40 
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The  next  example  limits  all  three  variable  to  be  dichotomous  (binary). 
ips. model  <-  make. lp(0j  3) 

add. constraint (Lps .model j  c(l}  2 }  4),  5) 

add. constraint (Ips .model j  c(lj  1}  6) }  2) 

add. constraint (Lps. model j  c(lj  1}  l)j  "="}  2) 

set.objfn(lps.modelj  c(2 ,  1,  2)) 

set .type ( Lps .model j  1}  "binary") 
set .type( Lps .model j  2}  "binary") 
set .type ( lps .mode l j  3 ,  "binary") 

print ( Lps. model ) 

##  Model  name: 

##  Cl 

##  Minimize  2 
##  R1  1 

##  R2  1 

##  R3  1 

##  Kind  Std 

##  Type  Int 

##  Upper  1 

##  Lower  0 

solve( lps. model ) 

##  [1]  0 

get .variables ( lps . mode  L ) 

##  [1]  1  1  0 


C2 

1 

2 

1 

1 


C3 

2 

4 

6 

1 


Std  Std 
Int  Int 
1  1 

0  0 


5 

2 

2 


22.2.4  Quadratic  Programming  (QP) 

QP  can  be  used  for  second  order  (quadratic)  objective  functions,  but  the  constraint 
functions  are  still  linear  combinations  of  the  domain  variables. 

A  matrix  formulation  of  the  problem  can  be  expressed  as  minimizing  an  objective 
function: 


f(X)  =  ^x'dx  -  dTX, 

where  X  is  a  vector  [. xlyx2 , . .  .,xn]  ,  D  is  the  matrix  of  weights  of  each  association 
pair,  xb  Xj,  and  d  are  the  weights  for  each  individual  feature,  xb  The  \  coefficient 
ensures  that  the  weights  matrix  D  is  symmetric  and  each  xh  xj  pair  is  not  double- 
counted.  This  cost  function  is  subject  to  the  constraints: 


atx 


where  the  first  k  constrains  may  represent  equalities  (=)  and  the  remaining  ones  are 
inequalities  (>),  and  b  is  the  constraints  right  hand  size  (RHS)  constant  vector. 
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Here  is  an  example  of  a  QP  objective  function  and  its  R  optimization: 


/(x  1,  V2,  V3)  =  2x\  —  X\%2  —  2x\  +  X2X3  +  2^3  —  5v2  +  3x3. 


Subject  to  the  following  constraints: 


— 4xi  H — 3x2  =  —  8 
2xi  +  X2  =  2 
—  2x2  +  X3  >  0 


i ibrary ( quad prog ) 

Dm  at 

<-  matrix(c(  2,  -1,  0, 

- 1 ,  2,  -1, 

0,  -i,  2), 

3, 

3) 

dvec 

<-  c(0,  -5,  3) 

Amat 

<-  matrix (c( -4 ,  -3,  0, 

2,  1,  0, 

0,  -2,  1), 

3, 

3) 

bvec 

<-  c(-8,  2,  0) 

n.eqs 

<-  2  #  the  first  two  constraints  are  equalities 

soi  <-  soLve.QP(Dmatj  dveCj  Amatj  bvec=bvecj  meq=2) 
soL$soLution  #  get  the  (xl,  x2,  x3)  point  of  minimum 

##  [1]  -1  4  8 

soL$vaLue  #  get  the  actual  cost  function  minimum 
##  [1]  49 

The  minimum  value,  49,  of  the  QP  solution  is  attained  at  xi  =  —  1 ,  x2  =  4,  x3  =  8. 

'T 

When  D  is  a  positive  definitive  matrix,  i.e.,  X  DX  >  0,  for  all  non-zero  X,  the  QP 
problem  may  be  solved  in  polynomial  time.  Otherwise,  the  QP  problem  is  NP-hard. 
In  general,  even  if  D  has  only  one  negative  eigenvalue,  the  QP  problem  is  still 
NP-hard. 

The  QP  function  solve  .  QP  ( )  expects  a  positive  definitive  matrix  D. 


22.3  General  Non-linear  Optimization 

The  package  Rsolnp  provides  a  special  function  solnpO,  which  solves  the 
general  non-linear  programming  problem: 

min/ (x) 

subject  to: 

g(x)  =  0 
k  <  h(x)  <  uh 
4  <  X  <  ux, 

where /(x),  g(x),  h(x)  are  all  smooth  functions. 
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22.3.1  Dual  Problem  Optimization 

Duality  in  math  really  just  means  having  two  complementary  ways  to  think  about 
an  optimization  problem.  The  primal  problem  represents  an  optimization  chal¬ 
lenge  in  terms  of  the  original  decision  variable  x  The  dual  problem ,  also  called 
Lagrange  dual ,  searches  for  a  lower  bound  of  a  minimization  problem  or  an 
upper  bound  for  a  maximization  problem.  In  general,  the  primal  problem  may  be 
difficult  to  analyze,  or  solve  directly,  because  it  may  include  non-differentiable 
penalty  terms,  e.g.,  norms,  recall  LASSO/Ridge  regularization  in  Chap.  18. 
Hence,  we  turn  to  the  corresponding  Lagrange  dual  problem  where  the  solutions 
may  be  more  amenable,  especially  for  convex  functions,  that  satisfy  the  following 
inequality: 


f(hc  +  (1  -  X)y)  <  If  0)  +  (1  -  X)f(y). 


Motivation 

Suppose  we  want  to  borrow  money,  x,  from  a  bank,  or  lender,  and  fix)  represents  the 
borrowing  cost  to  us.  There  are  natural  “design  constraints”  on  money  lending.  For 
instance,  there  may  be  a  cap  in  the  interest  rate,  h(x)  <  b ,  or  we  can  have  many  other 
constraints  on  the  loan  duration.  There  may  be  multiple  lenders,  including  self¬ 
funding,  that  may  “charge”  us  fix)  for  lending  us  x  Lenders  goals  are  to  maximize 
profits.  Yet,  they  can’t  charge  you  more  than  the  prime  interest  rate,  plus  some 
premium  based  on  your  credit  worthiness.  Thus,  for  a  given  fixed  A,  a  lender  may 
make  us  an  offer  to  lend  us  x  aiming  to  minimize 

f(x)  +1  x  h(x). 

If  this  cost  is  not  optimized,  i.e.,  minimized,  you  may  be  able  to  get  another  loan 
y  at  lower  cost  fiy)  <fix),  and  the  funding  agency  loses  your  business.  If  the  cost/ 
objective  function  is  minimized,  the  lender  may  maximize  their  profit  by  varying  A 
and  still  get  us  to  sign  on  the  loan. 

The  customer’s  strategy  represents  a  game  theoretic  interpretation  of  the  primal 
problem,  whereas  the  dual  problem  corresponds  to  the  strategy  of  the  lender. 

In  solving  complex  optimization  problems,  duality  is  equivalent  to  existence  of  a 
saddle  point  of  the  Lagrangian.  For  convex  problems,  the  double-dual  is  equivalent 
to  the  primal  problem.  In  other  words,  applying  the  convex  conjugate  (Fenchel 
transform)  twice  returns  the  convexification  of  the  original  objective  function,  which 
in  most  situations  is  the  same  as  the  original  function. 
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The  dual  of  a  vector  space  is  defined  as  the  space  of  all  continuous  linear 
functionals  on  that  space.  Let  X  =  Rn,  Y  =  Rm,f:  X  — ►  R,  and  h  :  X  — >  Y.  Consider 
the  following  optimization  problem: 


min  f(x) 

X 

subject  to 

h(x)  <  0. 

Then,  this  primal  problem  has  a  corresponding  dual  problem : 

min  inf  {fix)  +  (X,h(x))) 

X 

subject  to 

Xi  >  0,V0  <  i  <  m. 

The  parameter  X  E  Rm  is  an  element  of  the  dual  space  of  Y,  i.e.,  7*,  since  the  inner 
product  (X,  h(x))  is  a  continuous  linear  functional  on  Y.  Here  Y  is  finite  dimensional 
and  by  the  Riesz  representation  theorem  7*  is  isomorphic  to  Y.  Note  that  in  general, 
for  infinite  dimensional  spaces,  Y  and  7*  are  not  guaranteed  to  be  isomorphic. 


Example  1:  Linear  Example 

Minimize  fix,  y)  =  5x  —  3 y,  constrained  by  x  +  y  =136,  which  has  a  minimum 
value  of  —68  attained  at  (—10,  6).  We  will  use  the  Rsolnp  :  :  solnp  ( )  method  in 
this  example. 

#  install . packages( "Rsolnp" ) 

Library(RsoLnp) 

fnl  <-  function(x)  {  #  f(x,  y)  =  5x-3y 
5*x[l]  -  3*x[2] 

} 

#  constraint  zl:  xA2+yA2=136 
eqnl  <-  function(x)  { 

zl=x[l]A2  +  x[2]A2 
return(c(zl ) ) 

} 

constraints  =  c(136) 

x0  <-  c(lj  1 )  #  setup  initial  values 

soLl  <-  soLnp(x0j  fun  =  fnl,  eqfun  =  eqnlj  eqB  =  constraints) 

## 

##  Iter:  1  fn:  37.4378  Pars:  30.55472  38.44528 
##  Iter:  2  fn:  -147.9181  Pars:  -6.57051  38.35517 
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##  Iter 

3  fn: 

-154.7345 

Pars:  -20.10545  18.06907 

##  Iter 

4  fn: 

-96.4033 

Pars : 

-14.71366 

7.61165 

##  Iter 

5  fn: 

-72.4915 

Pars : 

-10.49919 

6.66517 

##  Iter 

6  fn: 

-68.1680 

Pars : 

-10.04485 

5.98124 

##  Iter 

7  fn: 

-68.0006 

Pars : 

-9.99999 

6.00022 

##  Iter 

8  fn: 

-68 . 0000 

Pars : 

-10. 00000 

6 . 00000 

##  Iter 

9  fn: 

-68 . 0000 

Pars : 

-10. 00000 

6 . 00000 

##  soLnp-->  Completed  in 

9  iterations 

soll$values[10]  #  soll$values  contains  all  steps  of  the  iteration  algorithm 
and  the  last  value  is  the  min  value 

##  [1]  -68 

soLl$pars 

##  [1]  -10  6 


Example  2:  Quadratic  Example 


o  o  0  0 

Minimize/(x,  y)  =  4x  +  10y  +5  subject  to  the  inequality  constraint  0  <  jr  +  y  <  4, 
which  has  a  minimum  value  of  5  attained  at  the  origin  (0, 0). 


fn2  <-  function(x)  {  #  f(x,  y)  =  4xA2  +  10yA2  +5 

4*x[l]*2  +  10*x[2]/K2  +5 

} 

#  constraint  zl:  xA2+yA2  <=  4 
ineq2  <-  f unction (x)  { 
zl=x[l]*2  +  x[2]*2 
return (c(zl ) ) 

} 

Lh  <-  c(0) 
uh  <-  c(4) 


x0  =  c(lj  1)  #  setup  initial  values 

soL2  <-  soLnp(x0j  fun  =  fn2,  ineqfun  =  ±neq2,  ineqLB  =  Lh ,  ineqUB=uh) 


## 

##  Iter: 
##  Iter: 
##  Iter: 
##  Iter: 
##  Iter: 
##  Iter: 
##  Iter: 
##  Iter: 
##  Iter: 
##  Iter: 
##  Iter: 
##  Iter: 
##  Iter: 
##  Iter: 
##  Iter: 
##  soLnp 


1  fn:  7.8697 

Pars : 

2  fn:  5.6456 

Pars : 

3  fn:  5.1604 

Pars : 

4  fn:  5.0401 

Pars : 

5  fn:  5.0100 

Pars : 

6  fn:  5.0025 

Pars : 

7  fn:  5.0006 

Pars : 

8  fn:  5.0002 

Pars : 

9  fn:  5.0000 

Pars : 

10  fn:  5.0000 

Pars : 

11  fn:  5.0000 

Pars : 

12  fn:  5.0000 

Pars : 

13  fn:  5.0000 

Pars : 

14  fn:  5.0000 

Pars : 

15  fn:  5.0000 

Pars : 

0.68437  0.31563 
0.39701  0.03895 
0.200217  0.002001 
0.10011821  0.00005323 
0.0500592618  0.0000006781 
0.02502983706  -0.00000004425 
0.01251 50021 5-0. 00000005034 
0 . 00625757145  -  0 . 00000005045 
0.00312915970  -0.00000004968 
0.00156561388  -0.00000004983 
0.0007831473  -0.0000000508 
0 . 00039896484  -  0 . 00000005045 
0.00021282342  -0.00000004897 
0.00014285437  -0.00000004926 
0.00011892066  -0.00000004976 


-->  Completed  in  15  iterations 
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soL2$vaLues 

##  [1]  19.000000  7.869675  5.645626  5.160388  5.040095  5.010024 

5.002506 

##  [8]  5.000627  5.000157  5.000039  5.000010  5.000002  5.000001 

5 . 000000 

##  [15]  5.000000  5.000000 

soL2$pars 

##  [1]  1 . 189207 e- 04  -4. 976052e-08 

There  are  a  number  of  parameters  that  control  the  solnp  procedure.  For 
instance,  TOL  defines  the  tolerance  for  optimality  (which  impacts  the  convergence) 
and  trace=0  turns  off  the  printing  of  the  results  at  each  iteration. 

ctrL  <-  List(T0L=le-15j  troce=0) 

soL2  <-  soLnp(x0j  fun  =  fn2j  ineqfun  =  ineq2j  ineqLB  =  ih ,  ineqUB=uhj  contro 

L=ctrL ) 

soL2$pars 

##  [1]  1.402813e-08  -5 . 015532e-08 


Example  3:  More  Complex  Non-linear  Optimization 


Let’s  try  to  minimize 


subject  to 


f(X)  =  -X1X2X3 

4x\X2  +  2^2X3  +  2^3X1  =  100 
1  <Xi  <  10,/=  1,2,3* 


fn3  <-  function(Xj  ...){ 

-x[l]*x[2]*x[3] 

} 

eqn3  <-  function(Xj  ...){ 

4*x[l] *x[ 2] +2 *x[2] *x[ 3]+2*x[3] *x[ 1 ] 

} 

constraints3  =  c(100) 

Lx  <-  rep(lj  3) 
ux  <-  rep(10j  3) 

pars  <-  c(2j  lj  7)  #  setup:  Try  alternative  starting-parameter  vector  (pars) 
Ctrl  <-  List(TOL=le-6j  trace=0) 

soL3  <-  soLnp(parSj  fun=fn3,  eqfun=eqn3j  eqB  =  constraints3j  LB=Lx}  UB=uXj  c 

ontroi=ctri ) 

soL2$vaLues 


##  [1] 

19.000000 

7.869675 

5.645626 

5.160388 

5.040095 

5.010024 

5.002506 

##  [8] 

5.000626 

5.000157 

5.000039 

5.000010 

5.000002 

5.000001 

5 . 000000 

##[15] 

5 . 000000 

5 . 000000 

5 . 000000 

5 . 000000 

5 . 000000 

5 . 000000 

5 . 000000 

##[22] 

5. 000000 

5 . 000000 

5 . 000000 

5 . 000000 

5 . 000000 

5 . 000000 

5 . 000000 

##[29]  5.000000 

soL3$pars 

##  [1]  2.886751  2.886751  5.773505 
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The  non-linear  optimization  is  sensitive  to  the  initial  parameters  (pars),  especially 
when  the  objective  function  is  not  smooth  or  if  there  are  many  local  minima.  The 
function  gosolnp  ()  may  be  employed  to  generate  initial  (guesstimates  of  the) 
parameters. 


Example  4:  Another  Linear  Example 

Let’s  try  another  minimization  of  a  linear  objective  function  fix,  y,  z)  =  Ay  —  2z 
subject  to 


2x  -  y  -  z  =  2 
x2  +y2  =  l. 


fn4  <-  function(x)  #  f(x^  y,  z)  =  4y-2z 
{ 

4*x[2]  -  2*x[3] 

} 

#  constraint  zl:  2x-y-z  =  2 

#  constraint  z2:  xA2+yA2  =  1 
eqn4  <-  function(x){ 

zl=2 *x[l]  -  x [2]  -  x[3] 
z2=x[l]A2  +  x[2]A2 

return(c(zlj  z2)) 

} 

constraints4  <-  c(2,  1) 

xe  <-  c (lj  1,1) 
ctri  <-  List(trace=0) 

soL4  <-  soLnp(x0,  fun  =  fn4,  eqfun  =  eqn4,  eqB  =  constraints4j  controL=ctrL) 
soL4$vaLues 

##  [1]  2.000000  -5.078795  -11.416448  -5.764047  -3.584894  -3.224531 

##  [7]  -3.211165  -3.211103  -3.211103 

soL4$pars 

##  [1]  0.55470019  -0.83205030  -0.05854932 

The  materials  in  the  linear  algebra  and  matrix  computing,  Chap.  5,  and  the 
regularized  parameter  estimation,  Chap.  18,  provide  additional  examples  of  least 
squares  parameter  estimation,  regression,  and  regularization. 


22.4  Manual  Versus  Automated  Lagrange  Multiplier 
Optimization 

Let’s  manually  implement  the  Lagrange  Multipliers  procedure  and  then  compare  the 
results  to  some  optimization  examples  obtained  by  automatic  R  function  calls.  The 
latter  strategies  may  be  more  reliable,  efficient,  flexible,  and  rigorously  validated. 
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The  manual  implementation  provides  a  more  direct  and  explicit  representation  of  the 
actual  optimization  strategy. 

We  will  test  a  simple  example  of  an  objective  function: 

f(x,y,z )  =  4y  -  2z  +  x2  +  y2, 

subject  to  two  constraints: 


2x  —  y  —  z  =  2 
x2  +  y2  +  z  =  1 . 


The  R  package  numDeriv  may  be  used  to  calculate  numerical  approximations 
of  partial  derivatives. 

#  define  the  main  Lagrange  Multipliers  Optimization  strategy  from  scratch 
require ( numDeriv ) 

Lagrange_muLtipLiers  <-  function(Xj  f,  g)  {  #  Objective/cost  function, 

f,  and  constrains,  g 

k  <-  length(X) 

L  <-  length(g(x)) 

#  Compute  the  derivatives 

grad_f  <-  function(x)  {  grad (f,  x)  } 

#  g,  representing  multiple  constrains,  is  a  vector-valued  function: 
its  first  derivative  is  a  matrix 

grad_g  <-  function(x)  {  jacobian(g,  x)  } 

#  The  Lagrangian  is  a  scalar-valued  function: 

#  L(x,  lambda)  =  f(x)  -  lambda  *  g(x) 

#  whose  first  derivative  roots  give  the  optimal  solutions 

#  h(x,  lambda)  =  c(  f'(x)  -  lambda  *  g'(x),  -  g(x)  ). 
h  <-  f unction (y)  { 

c (grad_f (y[l:k])  -  t(y[-(l:k)J)  %*%  grad_g (y[l:k]),  -g (y[l:k])) 

} 

#  To  find  the  roots  of  the  first  derivative,  we  can  use  Newton's  method: 

#  iterate  y  <-  y  -  h'(y)A{-l}  h(y)  until  certain  convergence  criterion 
is  met  #  e.g.,  (\delta  <=  le-6) 

grad_h  <-  function(y)  {  jacobian (hj  y)  } 

y  <-  c (Xj  rep(0j  L)) 
previous  <-  y  +  1 

lAjhi Le(sum(abs (y-previous) )  >  le-6)  { 
previous  <-  y 

y  <-  y  -  solved  gradji(y;,  h (y)  ) 

} 

y[l:k] 

} 

x  <-  c(0j  0,  0) 

#  Define  the  objective  cost  function 

fn4  <-  function(x)  #  f(x,  y,  z)  =  4y-2z  +  xA2+yA2 


22.4  Manual  Versus  Automated  Lagrange  Multiplier  Optimization 
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{ 

4*x[2]  -  2*x[3]  +  x[l]A2+  x[2]A2 
#sum(xA2) 

} 

#  check  the  derivative  of  the  objective  function 
grad (fn4j  x) 

##  [1]  0  4-2 

#  define  the  domain  constraints  of  the  problem 

#  constraint  zl:  2x-y-z  =  2 

#  constraint  z2:  xA2+yA2  +z  =  1 
eqn4  <-  f unction (x){ 

zl=2*x[l]  -  x[2]  -  x[3]  -2 
z2=x[l]A2  +  x[2]A2  +  x[3]  -1 
return(c(zlj  z2)) 

} 


#  Check  the  Jacobian  of  the  constraints 
jacobian(eqn4j  x) 

##  [,i]  [j2]  [A] 

##  [lj]  2  -1  -1 

##  [2/]  0  0  1 

#  Call  the  Lagrange-multipliers  solver 


#  check  one  step  of  the  algorithm 
/?  <-  Length(x) 

L  <-  Length(eqn4(x)); 
h  <-  function(x)  { 

c(grad(fn4j  x[l:k])  -  t(-x[(l:2)])  %*%  jacobian(eqn4j  x[l:k])j 
-eqn4(x[l : k] ) ) 

} 

jacobian(hj  x) 


## 

##  [. i  A 
##  [2,] 
##  [3,] 
##  [4, ] 
##  [5A 


[A]  [A]  [A] 

4  0  0. 000000e+00 

-1  2  5 . 482583e-15 

-1  1  0. 000000e+00 

-2  1  1.000000e+00 

0  0  -1 . 000000e+00 


#  Lagrange-multipliers  solver  for  f(x,  y,  z)  subject  to  g(x,  y}  z) 
Lagrange_muLtipLiers(Xj  fn4j  eqn4) 

##  [1]  0.3416408  -1.0652476  -0.2514708 


Now,  let’s  double-check  the  above  manual  optimization  results  against  the 
automatic  solnp  solution  minimizing 


f(x,y,  z)  =  4y  -  2z  +  x2  +  y2 


subject  to: 


2x  -  y  -  z  =  2 
jc2  +y2  =  1. 
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Library (RsoLnp) 

fn4  <-  function(x)  #  f(x,  y,  z)  =  4y-2z  +  xA2+yA2 

{ 

4*x[2]  -  2*x[3]  +  x[l]A2+  x[2]A2 

} 

#  constraint  zl:  2x-y-z  =  2 

#  constraint  z2:  xA2+yA2  +z  =  1 
eqn4  <-  f unction (x){ 

zl=2*x[l]  -  x[2]  -  x[3] 
z2=x[l]A2  +  x[2]A2  +  x[3] 
return(c(zlj  z2)) 

} 

constraints4  <-  c(2,  1) 

x0  <-  c (lj  1,1) 
ctrL  <-  List(trace=0) 

soL4  <-  soLnp(x0j  fun  =  fn4,  eqfun  =  eqn4 ,  eqB  =  constraints4 ,  controL=ctrL) 
soL4$vaLues 

##  [1]  4.0000000  -0.1146266  -5.9308852  -3.7035124  -2.5810141  -2.5069444 

##  [7]  -2.5065779  -2.5065778  -2.5065778 

The  results  of  both  (manual  and  automated)  experiments  identifying  the  optimal 
(x,y,z)  coordinates  minimizing  the  objective  function  fix,  y,  z)  =  4y  —  2z  +  x  +  y 
are  in  agreement. 

•  Manual  optimization:  lagrange_multipliers  (x,  f n4 ,  eqn4): 

0.3416408  -1.0652476  -0.2514708. 

•  Automated  optimization:  solnp(x0,  fun  =  f n4 ,  eqfun  =  eqn4 , 
eqB  =  constraints4 ,  control  =  Ctrl):  0.3416408  —1.0652476 
-0.2514709. 


22.5  Data  Denoising 

Suppose  we  are  given  xnoisy  with  n  noise-corrupted  data  points.  The  noise  may  be 
additive  (xnoisy  ~  x  +  e)  or  not  additive.  We  may  be  interested  in  denoising  the  signal 
and  recovering  a  version  of  the  original  (unobserved)  dataset  x,  potentially  as  a 
smoothed  representation  of  the  original  (uncorrupted)  process.  Smoother  signals 
suggest  less  (random)  fluctuations  between  neighboring  data  points. 

One  objective  function  we  can  design  to  denoise  the  observed  signal,  xnoisy ,  may 
include  &  fidelity  term  and  a  regularization  term ;  see  the  regularized  linear  modeling 
in  Chap.  18. 


22.5  Data  Denoising 


757 


Total  variation  denoising  assumes  that  for  each  time  point  t ,  the  observed 
noisy  data 


Xnoisy  (t)  ~  x{t)  +  e{t) 


observed  signal  native  signal  random  noise 


To  recover  the  native  signal ,  x(t),  we  can  optimize  (argmin/Qt))  the  following 
objective  cost  function: 


^  n—  1  n—  1 

V  ^  II  y(t)  -  Xnoisy(t)  ||2  +2  ^2  I  X(0  -  ^  u 


t=  1 


r=2 


- V- - 

fidelity  term 


v- 

regulari:ation  term 


where  2  is  the  regularization  smoothness  parameter,  A  — »  0  =>  y  — ■»  Mini¬ 

mizing  /(v)  provides  a  minimum  total-variation  solution  to  the  data  denoising 
problem. 

Below  is  an  example  illustrating  total  variation  (TV)  denoising  using  a  simulated 
noisy  dataset.  We  start  by  generating  an  oscillatory  noisy  signal.  Then,  we  compute 
several  smoothed  versions  of  the  noisy  data,  plot  the  initial  and  smoothed  signals, 
define  and  optimize  the  TV  denoising  objective  function,  which  is  a  mixture  of  a 
fidelity  term  and  a  regularization  term  (Fig.  22.4). 


Fig.  22.4  Denoising  by  smoothing,  raw  noisy  data  and  two  smoothed  models  (loess) 
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n  <-  1000 
x  <-  rep(0j  n) 

xs  <-  seq(0j  8j  Len=n)  #  seq(fnom  =  1,  to  =  1,  le  ngth) 
noise_LeveL  =  0.3  #  sigma  of  the  noise,  try  varying  this  noise  -level 

#  here  is  where  we  add  the  zero -mean  noise 
set. seed (1234) 

x_noisy  <-  function  (x)  { 

sin(x)A2/(1.5+cos(x))  +  rnorm ( Length (x) ,  0,  noise_LeveL) 

} 

#  initialize  the  manual  denoised  signal 
x_denoisedManu  <-  rep(0j  n) 

df  <-  as.data.  frame  (cbind  (xSj  x_noisy  (xs)  )  ) 

#  loess  fit  a  polynomial  surface  determined  by  numerical  predictors, 

#  using  local  fitting 

poLy_modeLl  <-  Loess  (x_noisy  (xs)  ~  xs,  span=0.1j  data=df)  #  tight  model 
poLy_modeL2  <-  Loess (x_noisy (xs)  ~  xs,  span=0.9j  data=df)  #  smoother  model 

#  To  see  some  of  numerical  results  of  hte  model  -fitting: 

#  View (as . data .frame(cbind(xs,  x_noisy,  predict  (polyjnodell) ) ) ) 

pLot(xSj  x_noisy(xs)j  type='L') 

Lines (xSj  poLy_modeLl$fittedj  coL="red" ,  Lwd=2) 

Lines  (xSj  poLy_modeL2$fittedj  coL="bLue" ,  Lvjd=3) 

Next,  let’s  initiate  the  parameters,  define  the  objective  function  and  optimize  it, 
i.e.,  estimate  the  parameters  that  minimize  the  cost  function  as  a  mixture  of  fidelity 
and  regularization  terms  (Fig.  22.5). 


Fig.  22.5  Manual  denoising  signal  recovery  using  non-linear  constaint  optimization  (solnp) 
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#  initialization  of  parameters 
betas_0  <-  c(0.3,  0.3 ,  0.5 ,  1) 
betas  <-  betas_0 

#  Denoised  model 

x_denoised  <-  function(x,  betas)  { 
if  (Length(betas)  !=  4)  { 

print (pasted (" Error !! !  Length(betas)  =  ",  Length(betas) ,  "  !=4!!!  Exiting 

...")) 

break(); 

} 

#  print(paste0("  ....  betas  =  ",  betas,  "...\n")) 

#  original  noise  function  definition:  sin(x)A2/(l. 5+cos(x) ) 
return( (betas[l]*sin(betas[2]*x) A2)/(betas[3]+cos(x) ) ) 

} 

Library (RsoLnp) 

fidelity  <-  function(x,  y)  { 

sqrt( (1/Length(x) )  *  sum((y  -  x)A2)) 

} 

reguiarizer  <-  function(betas)  { 
reg  <-  0 

for  (i  in  1 : ( Length(betas-l) ) )  { 
reg  <-  reg  +  abs(betas [i] ) 

} 

return(reg) 

} 

#  Objective  Function 
objective_func  <-  function(betas)  { 

#  f (x)  =  1/2  *  \sum_{t=l}A{n-l}  (|y(t)  -  x_{noisy}(t)\ | A2}}  +  \lambda  * 
\sum_{t=2}A{n-l}  |  x(t)  -  x(t-l) | 

fid  <-  fidelity (x_noisy (xs) ,  x_denoised(xs,  betas)) 
reg  <-  abs ( betas [4] ) ^reguiarizer (betas) 
error  <-  fid  +  reg 

#  uncomment  to  track  the  iterative  optimization  state 

#  print(paste0(" . . . .  Fidelity  =",  fid,  "  ...  Reguiarizer  =  ",  reg,  "  ... 
TotalError=",  error)) 

#  print ( paste© (" . . . .betas=(",betas[l], ",  ",  bet as [2], ",  ",  bet as [3], ",  ",  betas 
[4],")")) 

return(error) 

} 

#  inequality  constraint  forcing  regularization  parameter  lambda=beta[4] >0 
inequalConstr  <-  function(betas){ 

betas[4] 

} 

inequalLowerBound  <-  0;  inequalllpperBound  <-  100 

#  should  we  list  the  value  of  the  objective  function  and  the  parameters  at 
every  iteration  (default  trace=l;  trace=0  means  no  interim  reports) 

#  constraint  problem 

#  Ctrl  <-  list (trace=0,  tol=le-5)  ##  could  specify:  outer . iter=5, 
inner . iter=9) 

set.seed(121) 
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soL_Lambda  <-  soLnp(betas_0j  fun  =  objective_funCj  ineqfun  =  inequaLConstrj 
ineqLB  =  inequaLLowerBoundj  ineqUB  =  inequaLUpperBoundj  controL=ctrL) 

#  unconstraint  optimization 

#  Ctrl  <-  list(trace=lj  tol=le-5)  ##  could  specify:  outer . iter=5^ 
inner . iter=9) 

#  sol_lambda  <-  solnp(betas_0.,  fun  =  denoising_f unc control=ctrl) 

#  suppress  the  report  of  the  the  functional  values  (too  many) 

#  sol_lambda$values 

#  reprot  the  optimal  parameter  estimates  (betas) 
soL_Lumbda$purs 

##  [1]  2.5649689  0.9829681  1.7605481  0.9895268 

#  Reconstruct  the  manually-denoised  signal  using  the  optimal  betas 
betas  <-  soL_Lambda$pars 

x_denoisedManu  <-  x_denoised(xSj  betas) 

print (paste0(" Final  Denoised  Model: "9  betas[l]j  "*sin("j  betas[2]j 
"*x)A2/("J  betas [3] j  "+cos(x) ) ) " ) ) 

##  [1]  "Final  Denoised  Model:2.56496893433154*sin(0.982968123322892*x)A2/(l. 
76054814253387+cos (x) ) ) " 

plot( x_denoisedManu ) 

Finally,  we  can  validate  our  manual  denoising  protocol  against  the  automated  TV 
denoising  using  the  R  package  tvd  (Fig.  22.6). 


lO 


c n 
x 


ID 


CO 

tfl 

o 

5  o 
o 


\D 

d 

4 


Fig.  22.6  Plot  of  the  observed  noisy  data  and  four  alternative  denoised  reconstructions 
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#  install. packages("tvd") 

Library (" tv d") 

Lambda_0  <-  0.5 

x_denoisedTVD  <-  tvdld(x_noisy(xs) j  Lambda_0j  method  =  "Condat") 

#  lambda_o  is  the  total  variation  penalty  coefficient 

#  method  is  a  string  indicating  the  algorithm  to  use  for  denoising. 

#  Default  method  is  "Condat" 

#  plot(xSj  x_denoisedTVDj  type='l') 
pLot(xSj  x_noisy(xs)j  type='L') 

Lines(xSj  poLy_modeLl$fittedj  coL="red"j  Lwd=2) 

Lines(xSj  poLy_modeL2$fittedj  coi="biue" }  Lwd=3) 

Lines (xSj  x_denoisedManUj  coL="pink" ,  Lvtd=4) 

Lines (xSj  x_denoisedTVDj  coL="green" }  Lwd=5) 

#  add  a  legend 

Legend (” bottom" j  c("x_noisy" ,  "poLy_modeLl$fitted"j  "poLy_modeL2$fitted"j 
"x_denoisedManu" ,  "x_denoisedTVD" ) }  coL=c("bLack" j  "red"j  "bLue"j  "pink"j 
"green")j  Lty=c(lJlJ  ljl)j  cex=0.7j  Lwd=  c(lJ2J3J4J5) j  titLe="TV  Denoising") 


22.6  Assignment:  22.  Function  Optimization 
22.6.1  Unconstrained  Optimization 

Apply  optim  ( )  to  solve  the  following  unconstrained  optimization  problems: 

1.  min */(x)  =  x4. 

2.  max*  ^2  sin x  —  ^ . 

3.  max*  y(2xy  +  2x  —  x2  —  2 y2). 


22.6.2  Linear  Programming  (LP) 


Solve  the  following  LP  problem: 


subject  to: 


max  (xi  +  2x2  +  3^3  +  4x4  +  5) 


*3,  *4 

Ax\  H-  3x2  H~  2x3  H~  %4 

<  10 

X\  —  X3  +  2X4 

=  2 

X\  +  X2  +  X3  +  X4 

>  1 

X\  >  0,X3  >  0,  X4 

>  0 

Apply  IpSolveAPI  and  Rsolnp  and  compare  the  results. 
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22.6.3  Mixed  Integer  Linear  Programming  (MILP) 

Apply  IpSolveAPI  to  solve  the  following  MILP  problem: 

min  4x\  +  6x2 

X\,X2 

subject  to: 

r  2x  \  +  2x2  >  5 

^  Xi  -x2  <1 

I  xux2  >0 

x\,x2  G  integers 


22. 6. 4  Quadratic  Programming  ( QP) 

Solve  the  following  QP  problem: 

•  9  9 

mm  2xj  +  x2  +  xix2  +  x\  +  x2 

X\,X2 

subject  to: 

f  X\  +  X2  =1 
\  -^t  5  x2  >0' 


•  Apply  quadprog  to  solve  the  QP. 

•  Use  Rsolnp  to  solve  the  QP. 

•  Write  the  Lagrange  multiplier  form. 

•  Apply  numDeriv  to  solve  this  Lagrange  multiplier  optimization  manually. 

•  Compare  the  three  versions  of  the  results  above. 


22.6.5  Complex  Non-linear  Optimization 

Solve  the  following  nonlinear  problem: 

min  ( 100 (x2  —  x^)2  +  (1  —  x\)2 

Xi,X2  \  v  y 


subject  to  xi,  x2  >  0. 
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22.6.6  Data  Denoising 

Based  on  the  signal  denoising  example  presented  in  this  chapter,  try  to  change  the 
noise  level,  replicate  the  denoising  process,  and  report  your  findings. 
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Chapter  23 

Deep  Learning,  Neural  Networks 


® 

Check  for 
updates 


Deep  learning  is  a  special  branch  of  machine  learning  using  a  collage  of  algorithms 
to  model  high-level  data  motifs.  Deep  learning  resembles  the  biological  communi¬ 
cations  of  systems  of  brain  neurons  in  the  central  nervous  system  (CNS),  where 
synthetic  graphs  represent  the  CNS  network  as  nodes/states  and  connections/edges 
between  them.  For  instance,  in  a  simple  synthetic  network  consisting  of  a  pair  of 
connected  nodes,  an  output  sent  by  one  node  is  received  by  the  other  as  an  input 
signal.  When  more  nodes  are  present  in  the  network,  they  may  be  arranged  in 
multiple  levels  (like  a  multiscale  object)  where  the  ith  layer  output  serves  as  the 
input  of  the  next  (i  +  l)st  layer.  The  signal  is  manipulated  at  each  layer,  sent  as  a 
layer  output  downstream,  interpreted  as  an  input  to  the  next,  (i  +  l)st  layer,  and  so 
forth.  Deep  learning  relies  on  multipler  layers  of  nodes  and  many  edges  linking  the 
nodes  forming  input/output  (I/O)  layered  grids  representing  a  multiscale  processing 
network.  At  each  layer,  linear  and  non-linear  transformations  are  converting  inputs 
into  outputs. 

In  this  chapter,  we  explore  the  R  deep  learning  package  MXNet  and  demonstrate 
state-of-the-art  deep  learning  models  utilizing  CPU  and  GPU  for  fast  training 
(learning)  and  testing  (validation).  Other  powerful  deep  learning  frameworks  include 
TensorFlow ,  Theano ,  Caffe,  Torch ,  CNTK  and  Keras. 

Neural  Networks  vs.  Deep  Learning :  Deep  Learning  is  a  machine  learning 
strategy  that  learns  a  deep  multi-level  hierarchical  representation  of  the  affinities 
and  motifs  in  the  dataset.  Machine  learning  Neural  Nets  tend  to  use  shallower 
network  models.  Although  there  are  no  formal  restrictions  on  the  depth  of  the  layers 
in  a  Neural  Net,  few  layers  are  commonly  utilized.  Recent  methodological,  algo¬ 
rithmic,  computational,  infrastructure,  and  service  advances  overcome  previous 
limitations.  In  addition,  the  rise  of  Big  Data  accelerated  the  evolution  of  classical 
Neural  Nets  to  Deep  Neural  Nets ,  which  can  now  handle  lots  of  layers  and  many 
hidden  nodes  per  layer.  The  former  is  a  precursor  to  the  latter;  however,  there  are 
also  non-neural  deep  learning  techniques,  for  example,  syntactic  pattern  recognition 
methods  and  grammar  induction  discover  hierarchies. 


©  Ivo  D.  Dinov  2018 
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23  Deep  Learning,  Neural  Networks 


23.1  Deep  Learning  Training 

Review  Chap.  11  (Black  Box  Machine-Learning  Methods:  Neural  Networks  and 
Support  Vector  Machines)  prior  to  proceeding. 


23.1.1  Perceptrons 


A  perceptron  is  an  artificial  analogue  of  a  neuronal  brain  cell  that  calculates  a 
weighted  sum  of  the  input  values  and  outputs  a  thresholded  version  of  that  result.  For 
a  bivariate  perceptron,  P,  having  two  inputs,  (X,  7),  we  can  denote  the  weights  of  the 
inputs  by  A  and  B ,  respectively.  Then,  the  weighted  sum  could  be  represented  as: 


W  =  AX  +  BY. 


At  each  layer  /,  the  weight  matrix,  W^l\  has  the  following  properties: 

•  The  number  of  rows  of  W(l)  equals  the  number  of  nodes/units  in  the  previous 
(/  —  l)st  layer,  and 

•  The  number  of  columns  of  W(l)  equals  the  number  of  units  in  the  next  (/  +  l)st 
layer. 

Neuronal  cells  fire  depending  on  the  presynaptic  inputs  to  the  cell,  which  causes 
constant  fluctuations  of  the  neuronal  membrane  -  depolarizing  or  hyperpolarizing, 
i.e.,  the  cell  membrane  potential  rises  or  falls.  Similarly,  perceptrons  rely  on 
thresholding  of  the  weight-averaged  input  signal,  which  for  biological  cells  corre¬ 
sponds  to  voltage  increases  passing  a  critical  threshold.  Perceptrons  output  non-zero 
values  only  when  the  weighted  sum  exceeds  a  certain  threshold  C.  In  terms  of  its 
input  vector,  (X,  7),  we  can  describe  the  output  of  each  perceptron  (P)  by: 


Output(P) 


1,  ifAX  +  BY>C 
0,  if  AX  +  BY  <  C' 


Feed-forward  networks  are  constructed  as  layers  of  perceptrons  where  the  first 
layer  ingests  the  inputs  and  the  last  layer  generates  the  network  outputs.  The 
intermediate  (internal)  layers  are  not  directly  connected  to  the  external  world,  and 
are  called  hidden  layers.  In  fully  connected  networks ,  each  perceptron  in  one  layer  is 
connected  to  every  perceptron  on  the  next  layer  enabling  information  “fed  forward” 
from  one  layer  to  the  next.  There  are  no  connections  between  perceptrons  in  the  same 
layer. 

Multilayer  perceptrons  (fully-connected  feed-forward  neural  network)  consist  of 
several  fully-connected  layers  representing  an  input  matrix  Xn  x  m  and  a  gener¬ 
ated  output  matrix  Yn  x  k.  The  input  Xn  m  is  a  matrix  encoding  the  n  cases  and 
m  features  per  case.  The  weight  matrix  k  for  layer  /  has  rows  (/)  corresponding  to 
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Fig.  23.1  Graphical  representation  of  three  alternative  activation  functions 


the  weights  leading  from  all  the  units  i  in  the  previous  layer  to  all  of  the  units  j  in  the 
current  layer.  The  product  matrix  X  x  W  has  dimensions  n  x  k. 

The  hidden  size  parameter  k,  the  weight  matrix  Wm  x  h  and  the  bias  vector  bn  x  i 
are  used  to  compute  the  outputs  at  each  layer: 


yW 

1  nxk 


ff{xnxmwmxk 


+  bkx  l)- 


The  role  of  the  bias  parameter  is  similar  to  the  intercept  term  in  linear  regression 
and  helps  improve  the  accuracy  of  prediction  by  shifting  the  decision  boundary 
along  Y axis.  The  outputs  are  fully-connected  layers  that  feed  into  an  activation 
layer  to  perform  element-wise  operations.  Examples  of  activation  functions  that 
transform  real  numbers  to  probability-like  values  include  (Fig.  23.1): 

•  The  sigmoid  function,  a  special  case  of  the  logistic  function,  which  converts  real 
numbers  to  probabilities, 

•  The  rectifier  (relu,  Rectified  Linear  Unit)  function,  which  outputs  the  max(0, 
input), 

•  The  tanh  (hyperbolic  tangent  function). 

The  final  fully-connected  layer  may  be  hidden  of  size  equal  to  the  number  of 
classes  in  the  dataset  and  may  be  followed  by  a  sof  tmax  layer  mapping  the  input 
into  a  probability  score.  For  example,  if  a  size  n  x  m  input  is  denoted  by  Xn  x  m,  then 
the  probability  scores  may  be  obtained  by  the  sof  tmax  transformation  function, 
which  maps  real  valued  vectors  to  vectors  of  probabilities: 
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Hidden  Layers  (j=2,3) 


Figure  23.2  shows  is  a  schematic  of  fully -connected  feed-forward  neural  network 
of  nodes: 


\nJ ’4 

\uj=node  index,  l=layer  index  jj=  |  • 


The  plot  above  illustrates  the  key  elements  in  the  action  potential,  or  activation 
function,  and  the  calculations  of  the  corresponding  training  parameters: 


&node=k,  layer=l 


X  Cl; 


l- 1 


where: 

•  /is  the  activation  function ,  e.g.,  logistic  function  f(x)  =  ^  it  converts  the 

aggregate  weights  at  each  node  to  probability  values, 

•  wlk  •  is  the  weight  carried  from  the  ith  element  of  the  (/  —  l)th  layer  to  the  kth 
element  of  the  current  Zth  layer, 

•  blk  is  the  (residual)  bias  present  in  the  ^th  element  in  the  Zth  layer.  This  is 
effectively  the  information  not  explained  by  the  training  model. 

These  parameters  may  be  estimated  using  different  techniques  (e.g.,  using  least 
squares,  or  stochastically  using  steepest  decent  methods)  based  on  the  training  data. 


23.2  Biological  Relevance 

There  are  parallels  between  biology  (neuronal  cells)  and  the  mathematical  models 
(perceptrons)  for  neural  network  representation.  The  human  brain  contains  about  101 1 
neuronal  cells  connected  by  approximately  10 15  synapses  forming  the  basis  of  our 
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functional  phenotypes.  Figure  23.3  illustrates  some  of  the  parallels  between  brain 
biology  and  the  mathematical  representation  using  synthetic  neural  nets.  Every  neu¬ 
ronal  cell  receives  multi-channel  (afferent)  input  from  its  dendrites,  generates  output 
signals,  and  disseminates  the  results  via  its  (efferent)  axonal  connections  and  synaptic 
connections  to  dendrites  of  other  neurons. 

The  perceptron  is  a  mathematical  model  of  a  neuronal  cell  that  allows  us  to 
explicitly  determine  algorithmic  and  computational  protocols  transforming  input 
signals  into  output  actions.  For  instance,  a  signal  arriving  through  an  axon  v0  is 
modulated  by  some  prior  weight,  e.g.,  synaptic  strength,  w0  x  x0.  Internally,  within 
the  neuronal  cell,  this  input  is  aggregated  (summed,  or  weight-averaged)  with  inputs 
from  all  other  axons.  Brain  plasticity  suggests  that  synaptic  strengths 
(weight  coefficients  w)  are  strengthened  by  training  and  prior  experience.  This 
learning  process  controls  the  direction  and  strength  of  influence  of  neurons  on 
other  neurons.  Either  excitatory  (w  >  0)  or  inhibitory  (w  <  0)  influences  are  possible. 
Dendrites  and  axons  carry  signals  to  and  from  neurons,  where  the  aggregate 
responses  are  computed  and  transmitted  downstream.  Neuronal  cells  only  fire  if 
action  potentials  exceed  a  certain  threshold.  In  this  situation,  a  signal  is  transmitted 
downstream  through  its  axons.  The  neuron  remains  silent,  if  the  summed  signal  is 
below  the  critical  threshold. 

Timing  of  events  is  important  in  biological  networks.  In  the  computational 
perceptron  model,  a  first  order  approximation  may  ignore  the  timing  of  neuronal 
firing  (spike  events)  and  only  focus  on  the  frequency  of  the  firing.  The  firing  rate  of  a 
neuron  with  an  activation  function /represents  the  frequency  of  the  spikes  along  the 
axon.  We  saw  some  examples  of  activations  functions  earlier. 

Figure  23.3  illustrates  the  parallels  between  the  brain  network- synaptic  organi¬ 
zation  and  an  artificial  synthetic  neural  network. 
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Fig.  23.3  A  depiction  of  the  parallels  between  a  biological  central  nervous  system  network 
organization  (human  bran)  and  a  synthetic  neural  network  employed  in  deep  machine  learning 
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23.3  Simple  Neural  Net  Examples 

Before  we  look  at  some  examples  of  deep  learning  algorithms  applied  to  model 
observed  natural  phenomena.  Specifically,  we  will  develop  a  couple  of  simple 
networks  for  computing  fundamental  Boolean  operations. 


23.3.1  Exclusive  OR  (XOR)  Operator 

The  exclusive  OR  (XOR)  operator  works  as  a  bivariate  binary-outcome  function, 
mapping  pairs  of  false  (0)  and  true  (1)  values  to  dichotomous  false  (0)  and  true 
(1)  outcomes. 

We  can  design  a  simple  two-layer  neural  network  that  calculates  XOR.  The  values 
listed  within  each  neuron  represent  its  explicit  threshold,  which  can  be  normal¬ 
ized  so  that  all  neurons  utilize  the  same  threshold  (typically  1).  The  value  labels 
associated  with  network  connections  (edges)  represent  the  weights  of  the  inputs. 

When  the  threshold  is  not  reached,  the  output  is  0,  and  when  the  threshold  is  reached, 
the  output  is  correspondingly  1  (Fig.  23.4). 

Let’s  work  out  manually  the  four  possibilities  (Table  23.1): 

We  can  validate  that  this  network  indeed  represents  an  XOR  operator  by  plugging 
in  all  four  possible  input  combinations  and  confirming  the  expected  results  at  the  end 
(Fig.  23.5). 


Fig.  23.4  A  neural  network 
representation 
corresponding  to  the  XOR 
binary  function 


XOR  Operator 
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Table  23.1  Exact  XOR 
binary  operator 


XOR  Operator 


XOR  Operator 


InputX 

InputY 

XOR  output(Z) 
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Fig.  23.5 


Validation  of  the  explicit  neural  network  calculation  of  the  XOR  operator 


23.3.2  NAND  Operator 


Another  binary  operator  is  NAND  (negative  AND,  Sheffer  stroke),  which  produces  a 
false  (0)  output  if  and  only  if  both  of  its  operands  are  true  (1),  and  which  generates 
true  (1),  otherwise.  Below  is  the  NAND  input-output  table  (Table  23.2). 

Similarly  to  the  XOR  operator,  we  can  design  a  one-layer  neural  network  that 
calculates  NAND.  Again,  the  values  listed  within  each  neuron  represent  its  explicit 
threshold,  which  can  be  normalized  so  that  all  neurons  utilize  the  same  threshold 
(typically  1).  The  value  labels  associated  with  network  connections  (edges) 
represent  the  weights  of  the  inputs.  When  the  threshold  is  not  reached,  the  output 
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Table  23.2  Exact  NAND 
binary  operator 


Fig.  23.6  A  neural  network 
representation 

corresponding  to  the  NAND 
binary  function 
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NAND  output(Z) 
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NAND  Operator 


is  trivial  (0),  and  when  the  threshold  is  reached,  the  output  is  correspondingly  1 .  Here 
is  a  shorthand  analytic  expression  for  the  NAND  calculation: 

NAND(X,Y )  =  1.3  —  (1  x  X  +  1  x  7). 

Check  that  NAND(X ,  Y )  =  0  if  and  only  if  X  =  1  and  7=1,  otherwise  it 
equals  1  (Fig.  23.6). 


23.3.3  Complex  Networks  Designed  Using  Simple  Building 
Blocks 

Observe  that  stringing  together  some  of  these  primitive  networks  of  Boolean  oper¬ 
ators,  or/and  increasing  the  number  of  hidden  layers,  allows  us  to  model  problems 
with  exponentially  increasing  complexity.  For  instance,  constructing  a  4-input  NAND 
function  would  simply  require  repeating  several  of  our  2-input  NAND  operators.  This 
will  increase  the  space  of  possible  outcomes  from  22  to  24.  Of  course,  introducing 
more  depth  in  the  hidden  layers  further  expands  the  complexity  of  the  problems  that 
can  be  modeled  using  neural  nets. 
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Fig.  23.7  Live  Demo:  TensorFlow  and  ConvnetJS  deep  neural  network  webapps 


You  can  interactively  manipulate  Google's  TensorFlow  Deep  Neural  Network 
Webapp  to  gain  additional  intuition  and  experience  with  the  various  components  of 
deep  learning  networks. 

The  ConvnetJS  demo  provide  another  hands-on  example  using  2D  classification 
with  2-layer  neural  network  (Fig.  23.7). 


23.4  Classification 

In  MXNet,  a  Multilayer  perceptron  (MLP)  may  be  defined  by: 

•  Creating  a  place  holder  variable  for  the  input  data,  data  =  mx.  sym.  Vari¬ 
able  ( ' data ' ) 

•  Flattening  the  data  from  4D  shape  space  (width,  height,  batch_size, 
num_channel)  into  2D  (num_channel*width*height,  batch_size),  'data  =  mx. 
sym.Flatten(data=data)' 

•  And  iterating  over  the  fully-connected  layers: 
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-  First  layer,  fcl  =  mx .  sym.  FullyConnected  (data=data, 
name='fcl'  ,  num_hidden=12  8 ) 

-  Apply  relu  function  to  the  output  of  the  first  fully-connnected  layer,  act  1  = 

mx .  sym  .Activation  (data=f  cl ,  name=  '  relul '  , 

act_type= "  relu " ) 

-  Generate  the  second  fully-connected  layer  and  apply  the  activation  function, 
fc2  =  mx .  sym .  FullyConnected  (data=actl ,  name='fc2', 
num_hidden  =  64)  ;  act2  =  mx .  sym.  Activation  (data=fc2  , 
name=  '  relu2  '  ,  act_type=M relu" ) 

-  Generate  the  third/final  fully-connected  layer,  with  a  hidden  size  k—  10,  which 
in  digit  recognition  tasks  corresponds  to  the  number  of  unique  digits  0  :  9,  f  c3 
=  mx .  sym.  FullyConnected  (data=act2  ,  name='fc3', 
num_hidden=10 ) 

•  Finally,  mapping  the  input  into  a  probability  score  using  the  softmax  and  loss 
layer,  mlp  =  mx . sym . Sof tmaxOutput (data=f  c3 , 

name=  '  softmax  '  ) .  See  the  mlp  R  source  code  here. 


23.4.1  Sonar  Data  Example 

Let’s  load  the  mlbench  and  mlbench  packages  and  demonstrate  the  basic  invo¬ 
cation  of  mxnet.  The  Sonar  data  mlbench:  :  Sonar  includes  sonar  signals 
bouncing  off  a  metal  cylinder  or  a  roughly  cylindrical  rock.  Each  of  208  patterns 
includes  a  set  of  60  numbers  (features)  in  the  range  0.0- 1.0,  and  a  label  M  (metal)  or 
R  (rock).  Each  feature  represents  the  energy  within  a  particular  frequency  band, 
integrated  over  a  certain  period  of  time.  The  M  and  R  labels  associated  with  each 
observation  classify  the  record  as  rock  or  mine  (metal)  cylinder.  The  numbers  in  the 
labels  are  in  increasing  order  of  aspect  angle,  but  they  do  not  encode  the  angle 
directly. 


#  Load  the  required  packages:  mlbench  and  mxnet 

#  install. packages("mlbench");  install. packages("mxnet") 

#  Note  mxnet  requires  "visNetwork" 

#  If  it  doesn't  work,  you  may  need  the  following  lines: 

#  install. packages ("drat",  repos="https : //c ran . r studio. com" ) 

#  drat : : : addRepo( "dmlc" ) 

#  install. packages("mxnet") 

require (mlbench ) 
require (mxnet) 

##  Init  Repp 

data (Sonar j  package="mlbench" ) 
table(Sonar[j 61] ) 


23.4  Classification 


775 


## 

##  M  R 
##  111  97 

Sonar [j 61]  =  as.numeric(Sonar[J61])-l  #  R  =  "1",  "M"  =  "0" 
set.seed(123) 

train. ind  =  sample(l:nrow(Sonar)j0.7*nrow(Sonar)) 

train. x  =  data. matrix( Sonar [train .ind ,  1:60]) 
train. y  =  Sonar [train .ind}  61] 
test.x  =  data. matrix( Sonar [ -train .ind j  1:60]) 
test.y  =  Sonar[ -train .indj  61] 

Let’s  start  by  using  a  multi-layer  perceptron  as  a  classifier.  The  mxnet  function 
mx.mlp  builds  a  general  multi-layer  neural  network  that  can  be  utilized  to  do 
classification  or  regression  graph  modeling.  It  relies  on  the  following  parameters: 

•  Training  data  and  labels 

•  Number  of  hidden  nodes  in  each  hidden  layers 

•  Number  of  nodes  in  the  output  layer 

•  Type  of  activation 

•  Type  of  output  loss 

•  The  device  to  train  (GPU  or  CPU) 

•  Additional  optional  parameters,  see  mx .  model .  FeedForward .  create 
Here  is  one  example  using  the  training  and  testing  data  we  defined  above: 

mx. set. seed ( 1234 ) 

model. mx  <-  mx.mlp(train .x}  train.y}  hidden_node=8j  out_node=2j 
out_acti vation= "softmax " } 

num. round=200j  array . batch . size=15}  Learning. rate=0.1j  momentum=0 .9 } 
eval .met ric=mx. metric . accuracy j verbose=F) 

#calculate  Prediction  sensitivity  &  specificity 

preds  =  predict (model .mXj  test.x)  #  these  are  probabilities 

•  You  can  inspect  the  test  labels  vs.  assigned  probabilities  by 

•  View(data .frame(test .y,  preds[2,])) 

predsl  <-  ifelse(preds[2j ]  <=  0.5}  0,  1)  #  dichotomize  to  labels 
table(predsl ) 

##  predsl 
##0  1 
##  35  28 

pred. Label  =  t(predsl) 
table(pred. label j  test.y) 

##  test.y 

##  pred.  Label  0  1 

##  0  28  7 

##  1  6  22 

library ("caret") 

sensitivity (factor (predsl) j  f actor (as .numeric (test .y)) } positive  =  1) 

##  [1]  0.7586207 

specif icity( factor (predsl) j  f actor (as. numeric(test .y)) j negative  =  0) 

##  [1]  0.8235294 
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We  can  also  use  crossval :  :  diagnosticErrors  ()  and  crossval :  : 
confusionMatrix  ( )  to  get  more  detailed  evaluations.  Similar  to  using  the 
sensitivity  ( )  and  specificity  ( )  methods,  we  should  specify  the  negative 
and  positive  labels. 

Note  that  you  have  to  specify  crossval :  :  confusionMatrix  ( )  if  you  also 
have  the  caret  package  loaded,  as  caret  also  has  a  function  called 
confusionMatrix ( )  . 


Library ( " crossval ") 

diagnosticErrors(crossvaL:  :confusionFlatrix(predslJtest.yJ  negative  =  0)) 

##  acc  sens  spec  ppv  npv  Lor 

##  0. 7936508  0.  7857143  0.8000000  0.7586207  0.8235294  2.6855773 

##  attr(j  "negative") 

##  [1]  0 

Now,  we  compare  the  results  of  different  number  of  rounds,  or  epochs, 
representing  the  number  of  full  (training-phase)  passes  through  the  data  (cf.  num. 
round=n). 

mx. set . seed( 1234 ) 
get_pred  =  function(n){ 

model. mx  <-  mx.mlp(train .x}  train. y}  hidden_node=8j  out_node=2j  out_activa 
tion="softmax ", 

num. round=nj  array . batch . size=15j  Learning . rate=0. 1}  momentum=0 .9 , 
evaL .metric=mx. metric .accuracy j verbose=F) 
preds  =  predict (model. mxj  test.x) 

} 

predsl00  =  get_pred(100) 
preds50  =  get_pred(50) 
predsl0  =  get_pred(10) 

We  can  plot  the  ROC  curve  and  calculate  the  AUC  (Area  under  the  curve) 
(Fig.  23.8): 

#  install. packages("pROC");  install . packages( "plotROC" ) ;  install. packages("n 
eshape2") 

Library (pROC);  Library (plot ROC) ;  Library (reshape2) ; 

#  compute  AUC 

get_roc  =  function (preds) { 

roc_obj  <-  roc(test.yj  preds [2j ] ) 
auc(roc_obj ) 

} 

get_roc( preds) 

##  Area  under  the  curve:  0.9249 
get_roc ( predsl00 ) 
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##  Area  under  the  curve:  0.9209 

get_roc ( preds50 ) 

##  Area  under  the  curve:  0.8824 
get_roc ( predsl0) 

##  Area  under  the  curve:  0.8022 

#plot  roc 

dt  <-  data.frame(test.y,  preds[2,  ],  predsl00[2,  ] ,  preds50[2, ] ,  predsl0[2,  ]) 
coLnames(dt)  <-  c("D",  "rounds=200" ,  ”rounds=100" ,  "rounds=50" ,  "rounds=10") 
dt  <-  meLt(dtjid.vars  =  "D") 

basicpLot  <-  ggpLot(dtj  aes(d  =  D,  m  =  value,  coLour=variable) )  + 
geom_roc( labels  =  FALSE,  size  =  0.5,  alpha.  Line  =  0.6,  Linejoin  =  "mitre")  + 
theme_bw()  +  coord_fixed( ratio  =  1)  +  styie_roc()  +  ggtitie("ROC  CURVE" )+ 
annotate  ("red",  xmin  =  0.4,  xmax  =  1,  ymin  =  0.2,  ymax  =  0.75, 
alpha  =  .2)+ 

annotate  ("text",  x  =  0.7,  y  =  0.5,  size  =  3, 

Label  =  "AUC:  \n  rounds=200:  0.9209\n  rounds=100:  0.9128\n 
rounds=50:  0.8824\n  round s=10 :  0.8022\n  "  ) 
basicpLot 
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Fig.  23.8  ROC  curves  of  multi-layer  perceptron  predictions  (mx .  mlp),  using  out-of-bag  test-data, 
corresponding  to  different  number  of  iterations,  see  Chap.  14 
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The  plot  suggests  that  the  results  stabilize  after  100  training  (epoch)  iterations. 
Let’s  look  at  some  visualizations  of  the  real  labels  of  the  test  data  (test .  y)  and 
their  corresponding  ML-derived  classification  labels  (preds  [2  ,  ] )  using  200  iter¬ 
ations  (Figs.  23.9,  23.10,  23.11,  23.12,  and  23.13). 

graph . viz ( mode L . mx$symbo L ) 
hist(predsl0[2j ] jmain  =  "rounds=10") 

hist(preds50[2j ] jmain  =  "rounds=50") 

hist (predsl00[2j ] jmain  =  "rounds=100" ) 

hist(preds[2j ] jmain  =  "rounds=200" ) 


Fig.  23.9  MLP  model  structure  (the  plot  is  rotated  90-degrees  to  save  space) 
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Fig.  23.10  Frequency  plot  of  the  predicted  probabilities  using  ten  epochs  corresponding  to  ten  full 
(training-phase)  passes  through  the  data  (cf.  num.round=n) 
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Fig.  23.11  Frequency  plot  of  the  predicted  probabilities  using  50  epochs,  compare  to  Fig.  23.10 
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Fig.  23.12  Frequency  plot  of  the  predicted  probabilities  using  100  epochs,  compare  to  Fig.  23.11 


We  see  a  significant  bimodal  trend  when  the  number  of  rounds  increases.  Another 
plot  shows  more  details  about  the  agreement  between  the  real  labels  and  their 
predicted  class  counterparts  (Fig.  23.14): 


count  count 
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Fig.  23.13  And  finally,  the  plot  of  the  predicted  probabilities  using  200  epochs;  compare  to 
Fig.  23.12 
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Fig.  23.14  Summary  plots  illustrating  the  progression  of  the  neural  network  learning  from  10  ro 
200  epochs,  corresponding  with  improved  binary  classification  results  (testing  data) 
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Library (ggpLot2) 
get_gghist  =  function  (precis)  { 

ggpLot (data . frame ( test.yj  precis ) } 

aes(x=predSj  group=test.yj  fiLL=as,factor(test.y)))+ 
geom_histogram( position= "dodge  ", binwidth=0. 25)+theme_bw( ) 

} 

df  =  data.frame(preds[2J]Jpredsl00[2J]Jpreds50[2j]Jpredsl0[2J]) 
p  <-  LappLy(df , get_gghist) 

require (gridExtra)  #  used  for  arrange  ggplots 

grid.arrange(p$preds!0.2. . . jp$preds50.2. . . ,p$preds!00.2. . . }p$preds .2. . . ) 


23.4.2  MXNet  Notes 

•  The  mx .  mlp  ( )  function  is  a  proxy  to  the  more  complex  and  laborious  process  of 

defining  a  neural  network  by  using  MXNet ’s  Symbol.  For  instance,  this  call 
model. mx  <-  mx . mlp (train . x,  train. y,  hidden_node=8  , 
out_node=2  ,  out_activation=M  sof  tmax" ,  num .  round=2  0  , 

array .  batch .  size=15  ,  learning .  rate=0 . 1 ,  momentum=0 . 9 , 
eval .  metric=mx .  metric  .  accuracy)  would  be  equivalent  to  a  sym¬ 
bolic  network  definition  like:  data  <-  mx .  symbol  .Variable 
("data");  fcl  <-  mx . symbol . FullyConnec ted (data, 

num_hidden=128 )  actl  <-  mx . symbol . Activation ( fcl , 

name="  relul "  ,  act_type="  relu" )  ;  fc2  <-  mx. symbol. 
FullyConnected  (act  1 ,  name="fc2",  num_hidden=64 )  ;  act2 
<-  mx .  symbol  .Act ivat ion  ( fc2  ,  name="relu2  " , 

act_type="relu" )  ;  fc3  <-  mx .  symbol .  FullyConnected  (act 2  , 
name="fc3",  num_hidden=2 )  ;  lro  <-  mx. symbol. 
Sof tmaxOutput ( f c3  ,  name="sm")  ;  model2  <-  mx. model. 
FeedForward .  create  ( lro ,  X=train.x,  y=train.y,  ctx=mx. 
cpu  ( )  ,  num. round=100 ,  array . batch . size=15 ,  learning. 
rate=0 . 07 ,  momentum=0 . 9 )  (see  example  with  linear  regression  below). 

•  Layer-by-layer  definitions  translate  inputs  into  outputs.  At  each  level,  the  network 
allows  for  a  different  number  of  neurons  and  alternative  activation  functions. 
Other  options  can  be  specified  by  using  mx .  symbol: 

•  mx .  symbol .  Convolution  applies  convolution  to  the  input  and  then  adds  a 
bias.  It  can  create  convolutional  neural  networks. 

•  mx .  symbol .  Deconvolution  does  the  opposite  and  can  be  used  in  segmen¬ 
tation  networks  along  with  mx .  symbol .  UpSampling,  e.g.,  to  reconstruct  the 
pixel-wise  classification  of  an  image. 

•  mx .  symbol  .Pooling  reduces  the  data  by  selecting  signals  with  the  highest 
response. 

•  mx .  symbol  .Flatten  links  convolutional  and  pooling  layers  to  form  a  fully 
connected  network. 

•  mx .  symbol .  Dropout  attempts  to  cope  with  the  overfitting  problem. 
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The  function  mx.mlp  ()  is  a  wrapper  for  quick  design  of  standard  multi-layer 
perceptrons.  For  more  extensive  experiments,  customized  symbolic  representation 
can  be  explicitly  specified  using  combinations  of  the  above  methods. 

#  Example  of  **LeNet**  network  for  recognizing  handwritten  digits: 
data  <-  mx. symbol.  Variable (' data' ) 

convl  <-  mx. symbol. Convolution (data=data}  kernel=c(5j5) }  num_filter=20) 
tanhl  <-  mx. symbol. Activation (data=convlj  act_type="tanh" ) 
pooll  <-  mx. symbol. Pooling (data=tanhlj  pool_type="max"j  kernel=c(2j2) , 
stride=c(2j 2) ) 

conv2  <-  mx. symbol. Convolution (data=poollj  kernel=c(S}S) ,  num_filter=50) 
tanh2  <-  mx. symbol. Activation (data=conv2j  act_type="tanh" ) 
pool2  <-  mx. symbol . Pooling (data=tanh2j  pool_type="max" }  kernel=c(2j 2) , 
stride=c(2j2) ) 

flatten  <-  mx. symbol. Flatten (data=pooL2) 

fcl  <-  mx. symbol . FullyConnected(data=flattenj  num_hidden=500) 
tanh3  <-  mx. symbol. Activation (data=f cl j  act_type="tanh" ) 
fc2  <-  mx. symbol. FullyConnected(data=tanh3j  num_hidden=10) 
lenet  <-  mx. symbol . Softmax0utput(data=fc2) 

model  <-  mx. model. FeedForward. create ( lenet j  X=train .x}  y=train .y }  ctx=device 
.cpUj  num. round=5j  array . batch . size=100}  Learning. rate=0. 05 j  momentum=0.  9) 

To  allow  smooth,  fast,  and  consistent  operation  on  CPU  and  GPU,  in  in  mxnet, 
the  generic  R  function  controlling  the  reproducibility  of  stochastic  results  is  over¬ 
written  by  mx.  set .  seed.  So  can  use  mx.  set .  seed  ( )  to  control  random 
numbers  in  MXNet. 

To  examine  the  accuracy  of  the  model .  mx  learner  (trained  on  the  training  data), 
we  can  make  prediction  (on  testing  data)  and  evaluate  the  results  using  the  provided 
testing  labels  (report  the  confusion  matrix). 

preds  =  predict (mode L.mXj  test.x) 
pred. label  =  max. col (t (preds) ) -1 
table(pred. label j  test.y) 

##  test.y 

##  pred.  Label  0  1 

##  0  28  7 

##  1  6  22 

For  multi-class  predictions,  mxnet  outputs  n  (class)  x  m  (examples)  confusion 
matrices  where  each  row  corresponds  to  probability  of  the  corresponding  (column- 
defined)  class. 


23.5  Case-Studies 

Let’s  demonstrate  regression  and  prediction  deep  learning  examples  using  several 
complementary  datasets. 
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23.5.1  ALS  Regression  Example 

Let’s  first  demonstrate  a  deep  learning  regression  using  the  ALS  data  to  predict 
ALSFRS_slope,  Figs.  23.15  and  23.16. 

ais  <-  read. csv( "https ://umich .instructure . com/ files/ 1789624/ down  Load PdownLo 
ad_frd=l ") 

ais  <-  scaLe(ais[j  -c(lj7) ] ) 
train. ind  =  sample(l:nrow(als)j0.7*nrow(als)) 
train. x  =  data. matrix(ais [train .indj  -c(l,  7) ] ) 
train. y  =  aLs[train.indj7] 

test.x  =  data. matrix(scaie(ais[ -train .indj  -c(lj7) ]) ) 
test.y  =  aLs[-train.indj7] 

#  Define  the  input  data 

data  <-  mx. symbol. Variable ("data") 

#  A  fully  connected  hidden  layer 

#  data:  input  source 

#  num_hidden:  number  of  neurons  in  this  hidden  layer 
fcl  <-  mx. symbol. FuiiyConnected (data j  num_hidden=l) 

#  Use  linear  regression  for  the  output  layer 
Lro  <-  mx. symbol. LinearRegressionOutput( fcl) 

mx. set. seed ( 1234 ) 

#  Create  a  MXNet  Feedforward  neural  net  model  with  the  specified  training. 
model  <-  mx. model. FeedForward. create ( lro j  X=train .Xj  y=train .y, 

ctx=mx.cpu()j  num. round=1000j  array .batch. size=20j 

Learning . rate=2e-6j  momentum=0.9j  eva L . metric=mx . metric . rmsej verbose=F) 


preds 


Fig.  23.15  The  strong  linear  relation  between  the  out-of-bag  testing  data  continuous  outcome 
variable  (y-axis)  and  the  corresponding  predicted  regression  values  (x-axis)  suggests  a  good 
network  prediction  performance 
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Fig.  23.16  Computational 
graph  of  the  neural  network 


cc 

~o 


The  option  verbose  =  F  can  suppress  messages,  including  training  accuracy 
reports,  in  each  iteration.  You  may  rerun  the  code  with  verbose  =  T  to  examine 
the  rate  of  decrease  of  train  error  against  the  number  of  iterations . 

You  must  scale  data  before  inputting  it  into  MXnet,  which  expects  that  the 
training  and  testing  sets  are  normalized  to  the  same  scale.  There  are  two  strategies  to 
scale  the  data. 

•  Either  scaling  the  complete  data  simultaneously  and  then  splitting  them  into  train 
data  and  test  data,  or 

•  Alternatively,  scaling  only  the  training  dataset  to  enable  model- training,  but 
saving  your  protocol  for  data  normalization,  as  new  data  (testing,  validation) 
will  need  to  be  (pre)processed  the  same  way  as  the  training  data. 

Have  a  look  at  the  Google  TensorFlow  API.  It  shows  the  importance  of  learning 
rate  and  the  number  of  rounds.  You  should  test  different  sets  of  parameters. 

•  Too  small  learning  rate  may  lead  to  long  computations. 

•  Too  large  learning  rate  may  cause  the  algorithm  to  fail  to  converge,  as  large  step 
size  (learning  rate)  may  by-pass  the  optimal  solution  and  then  oscillate  or  even 
diverge. 

preds  =  predict(modeLj  test.x) 
sqrt(mean( (preds-test .y)A2) ) 

##  [1]  0.2171032 
range ( test.y) 

##  [1]  -3.181499  1.943890 

•  plot  the  correlation  between  testdata.y  and  testdata . predicted .y 
pLot(predSj  test.y) 

We  can  see  that  the  RMSE  on  the  test  set  is  pretty  small.  To  get  a  visual 
representation  of  the  deep  learning  network  we  can  also  display  this  relatively  simple 
computation  graph  (Fig.  23.16): 


graph . viz ( modei$symbo L ) 
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23.5.2  Spirals  2D  Data 

We  can  again  use  the  mx .  mlp  wrapper  to  construct  the  learning  network,  but  we 
can  also  use  a  more  flexible  way  to  construct  and  configure  the  multi-layer 
network  in  mxnet.  This  configuration  is  done  by  using  the  Symbol  call, 
which  specifies  the  links  among  network  nodes,  the  activation  function,  dropout 
ratio,  and  so  on: 

Below  we  show  the  configuration  of  a  perceptron  with  one  hidden  layer. 

###########  Network  configuration 

#  variables 

act  <-  mx. symbol. Variable ("data") 

#  affine  transformation 

fc  <-  mx. symbol. FullyConnected (act j  num. hidden  =  10) 

#  non-linear  activation 

act  <-  mx. symbol. Activation (data  =  fCj  act_type  =  "relu") 

#  affine  transformation 

fc  <-  mx. symbol. FullyConnected (act j  num. hidden  =2) 

#  softmax  output  and  cross- 

mlp  <-  mx. symbol. So ftmaxOut put (fc) 

####Preparing  data 
set .seed (2235) 

############  spirals  dataset 

s  <-  sample(x  =  c( "train" ,  "test")j  size  =  1000}  prob  =  c(.8J.2)}  replace  = 
TRUE) 

dta  <-  mlbench. spirals (n  =  1000 }  cycles  = 1.2 ,  sd  =  .03) 
dta  <-  cbind(dta[ [ "x" ] ] j  as .integer (dta[[" classes" ]])  -  1) 
colnames(dta)  <-  c("x"j  "y"j  "Label") 

#########  train}  validate}  test 
dta. train  <-  dta[s  ==  "train" }] 
dta. test  <-  dta[s  ==  "test"j] 

Let’s  display  the  data  and  examine  its  structure  (Fig.  23.17). 

dt  <-  as.data.frame(dta);dt[j3]  <-  as.factor(dt[j3]) 
dt. train  <-  dt[s  ==  "train" }] 
dt.test  <-  dt[s  ==  "test"j] 

pi  <-  ggplot(dtjaes(x  =  x}y  =  yJcolor=label))+geom_point()+ggtitle("Nhole 
data  structure") 

p2  <-  ggplot(dt .train jaes(x  =  x,y  = 

y}  color=label) )+geom_point( )+ggtitle( "Train  data  structure") 
p3  <-  ggplot(dt.testjaes(x  =  x,y  = 

y,  color=label) )+geom_point( )+ggtitle( "Test  data  structure") 
grid. arrange ( pi j  p2j  p3j nrow=3) 
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Whole  data  structure 
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Fig.  23.17  Original  spirals  data  structure  (whole,  traning  and  testing  sets) 


#  Network  training 

#  Feed-forward  networks  may  be  trained  using  iterative  gradient  descent  algo 
rithms.  A  **batch**  is  a  subset  of  data  that  is  used  during  single  forward  p 
ass  of  the  algorithm.  An  **epoch**  represents  one  step  of  the  iterative  proc 
ess  that  is  repeated  until  all  training  examples  are  used. 

############  basic  spiraL-data  training 
mx. set.seed(2235 ) 

modeL  <-  mx. model. FeedForward. create ( 
symbol  =  mlp, 

X  =  dta.train[,  c("x",  "y")]j 
y  =  dta.train[,  c(" Label ") ], 
n urn. round  =  500 j 
array. Layout  =  "rowmajor"j 
Learning . rate  =  1 , 

eval. metric  =  mx. met ric . accuracy, verbose  =  F) 
preds  =  predict(modelj  dta.test[,c(l:2)]) 

pred. label  =  max.col(t(preds))-l;  table(pred . label,  dta.test[, 3] ) 

##  pred.  Label  0  1 

##  0  90  30 

##  1  22  73 
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Fig.  23.18  Frequency  of  feed-forward  neural  network  prediction  probabilities  (x-axis)  for  the 
spirals  data  relative  to  testing  set  labels  (colors) 


The  prediction  result  is  close  to  perfect,  and  we  can  inspect  deeper  the  results 
using  crossval :  :  confusionMatrix  (Fig.  23.18). 


Library ( "crossvaL ") 

diagnosticErrors( crossval : :confusionMatrix(pred.  LabeLj dta. test[j 3],  negative 
=  0)) 


##  acc  sens  spec  ppv  npv  Lor 

##  0. 7581395  0. 7684211  0.7500000  0.7007379  0.8035714  2.2980293 

##  attr(j  "negative") 

##  [1]  0 

ggpLot (data. frame (dta. test[j 3]}  preds[2j ] ) } 

aes(x=preds[2j ]}  group=dta. test[j 3]}  fiLL=as . factor (dta . test[}  3])))+ 
geom_histogram( position= "dodge " } binwidth=0. 25)+theme_bw( ) 


Once  we  fit  a  model  (like  the  binary  label  classification  below),  we  can: 

•  Visually  inspect  the  quality  of  the  ML  classification. 

•  Display  the  structure  of  the  labeled  test-data  objects  (Fig.  23.19). 
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Structure  of  Predicted-Labels  on  Test  Data 
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Fig.  23.19  Testing-data  validation  of  neural  network  model  (spirals) 


#  define  a  custom  call-back,  which  stops  the  process  of  training  when  the  pr 
ogress  in  accuracy  is  below  certain  level  of  tolerance.  It  call  is  made  afte 
r  every  epoch  to  check  the  status  of  convergence  of  the  algorithm. 

mx.  callback,  train . stop  <-  function (tot  =  le-3 ,  mean.n  =  le2 ,  period  =  100 ,  m 
in. iter  =  100)  { 

function (iteration j  nbatchj  env,  verbose  =  TRUE)  { 
if  (nbatch  ==  0  &  ! is ,nuLL(env$metric) )  { 
continue  <-  TRUE 

acc. train  <-  env$metric$get(env$train .metric)$vaiue 
if  (is.nuLL(env$acc . Log) )  { 
env$acc.Log  <-  acc. train 
}  eLse  { 

if  ( (abs(acc .train  -  mean(taii(env$acc . Logj  mean.n)))  <  toL  & 
abs(acc .train  -  max(env$acc .  Log) )  <  toL  & 
iteration  >  min. iter)  / 
acc. train  ==  1)  { 

cat  ("Training  finished  wit/i  finaL  accuracy:  ", 
round(acc .train  *  100,  2),  "  %\n ",  sep  =  "") 
continue  <-  FALSE 

} 

env$acc.Log  <-  c(env$acc .  Log,  acc. train) 

} 

} 

if  (iteration  %%  period  ==  0)  { 

cat("[ ",  iteration,  "]",  "  training  accuracy:  ", 
round  (acc  .train  *  100,  2),  "  %\n",  sep  =  "") 

} 

return( continue ) 

} 


} 
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######  training  with  custom  stopping 
mx. set.seed(2235 ) 

modeL  <-  mx. model. FeedForward. create ( 
symbol  =  mlpj 

X  =  dta.train[,  c("x",  "y")], 
y  =  dta.train[,  c("LabeL")], 
n urn. round  =  2000, 
array. Layout  =  " rowmajor ", 

Learning . rate  =  1, 

epoch. end. callback  =  mx. callback. train. stop () , 
eval .metric  =  mx. metric. accuracy, 
verbose  =  FALSE 

) 

##  [100]  training  accuracy:  75.56  % 

##  [200]  training  accuracy:  76  % 

##  [300]  training  accuracy:  76  % 

##  [400]  training  accuracy:  76.45  % 

##  Training  finished  with  final  accuracy:  76.45  % 

Labeled_spiral_data  <-  as. data. frame (cbind(dta. test [,  c("x",  "y")], 
as.factor(pred. label ))) 

colnames(Labeled_spiral_data)  <-  c("x ",  "y",  "Label”) 
Labeled_spiraL_data$Label  <-  as .factor ( Labeled_spiraL_data$LabeL ) 
p4  <-  ggplot(labeled_spiral_data,  aes(x  =  x,y  =  y, color=label) )+ 
geom_point( )+ggtitle( "Structure  of  Predicted-Labels  on  Test  Data") 
p4 


23.5.3  IBS  Study 

Let’s  try  another  example  using  the  IBS  Neuroimaging  study  (Figs.  23.20  and 
23.21). 

#  IBS  NI  Data 

#  UCLA  Data 
wiki_url  <- 

read_html (” http :/ /wiki . stat . ucla.edu/socr/index.php/SOCR_Data_ApriL2011_NI_ 
IBS_Pain  ") 

IBSData<-  html_table(html_nodes(wiki_url ,  "table" ) [ [2] ] )  #  table  2 
set. seed (1234) 

test.ind  =  sample(l : 354,  50,  replace  =  F)  #  select  50/354  of  cases  for 
testing,  train  on  remaining  (354-50)/354  cases 

#  UMich  Data  (includes  MISSING  data):  use  'mice'  to  impute  missing  data 
with  mean:  newData  <-  mice(data,m=5,maxit=50,meth= 1 pmm' , seed=500) ; 
summary(newData) 

#  wiki_url  <- 

read_html ( "http : //wiki . socr . umich . edu/index . php/SOCR_Data_April2011_NI_IBS_Pain" ) 

#  IBSData<-  html_table(html_nodes (wiki_url,  "table" )[ [1] ] )  #  load  Table  1 

#  set . seed(1234) 

#  test.ind  =  sample(l : 337,  50,  replace  =  F)  #  select  50/337  of  cases  for 
testing,  train  on  remaining  (337-50)/337  cases 

#  summary(IBSData) ;  IBSData[IBSData==" . "]  <-  NA;  newData  <-  mice(IBSData, 
m=5,maxit=50,meth= ' pmm' , seed=500) ;  summary (newData) 
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htmL_nodes  (iAjihi_urlj  "ttcontent ") 

##  {xmL_nodeset  (1)} 

##  [1]  <div  id=" content" >\n\t\t<a  name="top"  id="top" ></a>\n\t\t\t\t<hl 
i  d —  •  •  • 

#  View  (IBSData);  dim(IBSData) :  Select  an  outcome  response  "DX"(3)., 

"FS_IQ"  (5) 

#  scale/normalize  all  input  variables 
IBSData  <-  na.omit(IBSData) 

IBSData[j4:66]  <-  scaLe(IBSData[ ,4: 66])  #  scale  the  entire  dataset 

train. x  =  data. matrix ( IBSData [ -test .indj  c(4:66)])  #  exclude  outcome 

train. y  =  IBSData[-test.indj  3]-l 

test.x  =  data. matrix( IBSData [ test .indj  c(4:66)]) 

test.y  =  IBSData[test .indj  3]-l 

#  View(data .frame(train .x,  train. y)) 

#  View(data .frame(test .x,  test.y)) 

#  table(test .y) ;  table(train .y) 

#  num. round  -  number  of  iterations  to  train  the  model 
act  <-  mx. symbol. Variable ("data") 

fc  <-  mx. symbol. FuLLyConnected(actj  num. hidden  =  10) 

act  <-  mx. symbol. Activation (data  =  fCj  act_type  =  "sigmoid" ) 

fc  <-  mx. symbol. FullyConnected (act j  num. hidden  =2) 

mlp  <-  mx.  symbol.  So ftmaxOut  put  (fc) 

mx. set.seed(2235 ) 

model  <-  mx. model. FeedForward. create ( 
symbol  =  mlpj 
array. batch . size=20j 
X  =  train. Xj  y=train.yj 
num. round  =  200 } 
array . Layout  =  "rowmajor" j 
Learning . rate  =  exp(-l)j 

eval. metric  =  mx. metric. accuracy j  verbose=FALSE) 
preds  =  predict(modelj  test.x) 

pred. label  =  max.col(t(preds))-l;  table(pred . Labelj  test.y) 

##  test.y 

##  pred.  Label  0  1 

##  0  23  10 

##  1  10  7 

library ( "crossval ") 

diagnosticErrors(crossval : : confusionMatrix(pred.  Labelj test .y}  negative  =  0)) 

##  acc  sens  spec  ppv  npv  Lor 

##  0.6000000  0.4117647  0.6969697  0.4117647  0.6969697  0.4762342 

##  attr(j  "negative") 

##  [1]  0 

ggplot (data .  frame ( test.yj  preds [2 j ] ) } 

aes(x=preds[2j ] j  group=test.yj  fill=as.factor(test.y)))+ 
geom_histogram(position= "dodge "j binwidth=0. 25)+theme_bw( ) 
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Fig.  23.20  Frequency  of  the  feed-forward  neural  network  prediction  probabilities  (x-axis)  for  the 
IBS  data  relative  to  testing  set  labels  (colors) 
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Fig.  23.21  Validation  results  of  the  binarized  feed-forward  neural  network  prediction  probabilities 
(y-axis)  for  the  IBS  testing  data  (x-axis)  with  label-coding  for  mateh(0)/mismateh(l) 
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#  convert  pred-probability  to  binary  classes  threshold=0. 3? 
bin_preds  <-  if  else  (precis  [2  j  ]<0. 3}  0,  1) 

#  get  a  factor  variable  comparing  binary  test-labels  vs.  predcted-labels 
LabeL_match  <-  as .factor (if else  (test .y==bin_predSj  0,  1 )) 

p5  <-  ggplot(data. frame (test .y3  preds[2, ] ),  aes(x  =  test.yj  y  =  preds[2j ] j 
coLor=LabeL_match) )+geom_point( )+ggtitie( "Match  between  Test  Data  LabeLs  and 
Predicted  LabeLs") 
p5 


This  histogram  plot  suggests  that  the  classification  is  not  good  (Fig.  23.20). 


23.5.4  Country  QoL  Ranking  Data 

Another  case  study  we  have  seen  before  is  the  country  quality  of  life  (QoL)  dataset. 
Let’s  explore  a  new  neural  network  model  and  use  it  to  predict  the  overall  country  QoL. 

wihi_url  <- 

read_htmL  (" http :/ /wiki .  stat .  ucLa.edu/socr/index.php/SOCR_Data_2008_lAlorLd_ 

CountriesRankings ") 

htmL_nodes ( wihi_url3  "#content ") 

##  {xmL_nodeset  (1)} 

##  [1]  <div  id=" content" >\n\t\t<a  name="top"  id="top" ></a>\n\t\t\t\t<hl 
id=  . . . 

Country RankingData< -  htmL_tabie (htmi_nodes (wiki  ur L }  "tabLe" ) [ [2] ] ) 

#  View  (CountryRankingData) ;  dim(CountryRankingData) :  Select  an  appropriate 

#  outcome  "OA" :  Overall  country  ranking  (13) 

#  Dichotomize  outcome.  Top-countries  OA<20,  bottom  countries  OA>=20 
set .seed (1234) 

test.ind  =  sampLe(l : 100 }  30 ,  repLace  =  F)  #  select  15/100  of  cases  for 
testing,  train  on  remaining  85/100  cases 

Country Ranhing Data [ j c(8:123  14)]  <-  scale (CountryRankingData [ , c(8:12j 14)]) 

#  scale/normalize  all  input  variables 

train. x  =  data. matrix(CountryRankingData[ -test .indj  c(8 : 12, 14) ] )  #  exclude 
outcome 

train. y  =  ifeise(CountryRankingData[-test.indj  13]  <  50,  1,  0) 

test.x  =  data. matrix(CountryRanhingData[ test .indj  c(8:12jl4)]) 

test.y  =  ifeise(CountryRankingData[test.indj  13]  <  50,  1,  0)  #  developed 

(high  OA  rank)  country 

#  View(data .frame(train .x,  train. y));  View(data.frame(test.x,  test.y)) 

#  View(data.frame(CountryRankingData,  ifelse(CountryRankingData[, 13]  <  20, 

1,  0))) 

act  <-  mx. symbol .Variable ( "data" ) 

fc  <-  mx. symbol. FullyConnected(actj  num. hidden  =  10) 

act  <-  mx. symbol. Activation (data  =  fCj  act_type  =  "sigmoid" ) 

fc  <-  mx. symbol. FullyConnected (act ,  num. hidden  =  2) 

mlp  <-  mx. symbol. So ftmaxOutput(fc) 

mx. set.seed(2235 ) 

model  <-  mx. mode L. FeedForward. create ( 
symbol  =  mlpj 
array. batch . size=10} 
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X  =  train. Xj  y=train .y, 

num. round  =  15 , 

array . Layout  =  "rowmajor" , 

Learning . rate  =  exp(-l)j 

evaL. metric  =  mx. metric .accuracy ) 


##  Start  training  with  1  devices 
##  [1]  Train- accuracy =0 .416666666666667 
##  [2]  Train -accuracy =0.442857142857 143 
##  [3]  Train-accuracy=0. 442857142857143 
##  [4]  Train-accuracy=0. 442857142857143 
##  [5]  Train-accuracy=0. 442857142857143 
##  [6]  Train -accuracy =0.442857142857 143 
##  [7]  Train-accuracy=0.6 
##  [8]  Train-accuracy=0.8 
##  [9]  Train-accuracy=0. 914285714285714 
##  [10]  Train-accuracy=0. 928571428571429 
##  [11]  Train-accuracy=0. 942857142857143 
##  /7Z2J  Train-accuracy=0. 942857142857143 
##  [13]  Train-accuracy=0. 942857142857143 
##  /"14J  Train-accuracy=0. 971428571428572 
##  [15]  Train-accuracy=0. 971428571428572 

preds  =  predict(modeLj  test.x )j  preds 


## 

##  [1,] 
##  [ 2 
## 

##  [1,] 
##  f2,7 
## 

## 

##  f2,J 
## 

##  [1,] 
##  [ 2 ,] 
## 

##  [1,] 
##  f2,J 


[,1] 

0.5204602 

0.4795398 

[j  7J 
0.4493238 
0.5506761 
[,13] 


[,2]  [,3]  [,4]  [,5]  [,6] 

0.8808465  0.007948651  0.009155557  0.8622462  0.8432776 
0.1191535  0.992051363  0.990844429  0.1377538  0.1567224 
[j8]  [,9]  [,10]  [,11]  [,  12] 

0.6563529  0.97970927  0.7055513  0.98414272  0.9647682 
0.3436471  0.02029071  0.2944487  0.01585729  0.0352319 
[,  14]  [,15]  [,  16]  [,  17]  [,  18] 


0.6106228  0.91565907  0.8317797  0.0252018  0.7618818  0.01770884 
0.3893772  0.08434091  0.1682204  0.9747981  0.2381181  0.98229110 

[,  19]  [,20]  [,  21]  [,22]  [,  23]  [,24] 

0.007323461  0.7766624  0.94527471  0.007209368  0.09066615  0.007661197 
0.992676497  0.2233376  0.05472526  0.992790580  0.90933383  0.992338777 
[,  25]  [,  26]  [,  27]  [,  28]  [,29]  [,30] 

0.0489373  0.009559323  0.91361207  0.1901348  0.90563852  0.97519016 
0.9510627  0.990440726  0.08638796  0.8098652  0.09436146  0.02480989 


pred. Label  =  max.coL(t(preds))-l;  tabLe(pred . Label,  test.y) 


##  test.y 

##  pred.  Label  0  1 

##  0  17  1 

##  1  1  11 


We  only  need  15  rounds  to  achieve  97%  accuracy  (Figs.  23.22  and  23.23). 


ggpLot (data . frame ( test.y,  preds [2,  ] ), 

aes(x=preds[2, ] ,  group=test.y,  fill=as ,factor(test .y) ) )+ 
geom_histogram( position= "dodge ",  binwidth=0 . 25)+theme_bw( ) 
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Fig.  23.22  Frequency  of  the  feed-forward  neural  network  prediction  probabilities  (x-axis)  for  the 
QoL  data  relative  to  testing  set  labels  (colors) 
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Fig.  23.23  Validation  results  of  the  binarized  feed-forward  neural  network  prediction  probabilities 
(y-axis)  for  the  QoL  testing  data  with  label-coding  for  match(0)/mismateh(l) 
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#calculate  sensitivity  &  specificity  and  more 
Library ("crossvaL ") 

diagnosticErrors(crossvaL : : confusionMatrix(pred.  LabeLj test .y,  negative  =  0)) 

##  acc  sens  spec  ppv  npv  Lor 

##  0.9333333  0.9166667  0.9444444  0.9166667  0.9444444  5.2311086 

##  attr(j  "negative" ) 

##  [1]  0 

#  convert  pred-probability  to  binary  classes  threshold=0. 5? 
bin_preds  <-  ifeLse  (preds[2j ]<0. 5j  0,  1) 

#  get  a  factor  variable  comparing  binary  test-labels  vs.  predicted-labels 
LabeL_match  <-  as .factor (if eLse  (test .y==bin_predSj  0j  1)) 

p6  <-  ggpLot(data.frame(test.yj  preds[2j ]) j  aes(x  =  test.yj  y  =  preds[2j ] , 
coLor=LabeL_match) )+geom_point( )+ggtitLe( "Match  between  Test  Data  LabeLs 
and  Predicted  LabeLs") 

p6 


23.5.5  Handwritten  Digits  Classification 

In  Chap.  1 1  (ML,  NN,  SVM  Classification)  we  discussed  Optical  Character  Recog¬ 
nition  (OCR).  Specifically,  we  analyzed  handwritten  notes  (unstructured  text)  and 
converted  it  to  printed  text. 

MNIST  includes  a  large  set  of  human  annotated/labeled  handwritten  digits 
imaging  data  set.  Every  digit  is  represented  by  a  28  x  28  thumbnail  image.  You 
can  download  the  training  and  testing  data  from  Kaggle. 

The  train .  csv  and  test .  csv  data  files  contain  gray-scale  images  of  hand- 
drawn  digits,  0,  1,2,  . . .,  9.  Each  2D  image  is  28  x  28  in  size  and  each  of  the 
784  pixels  has  a  single  pixel-intensity  representing  the  lightness  or  darkness  of  that 
pixel  (stored  as  a  1  byte  integer  [0,255]).  Higher  intensities  correspond  to  darker 
pixels. 

The  training  data,  train .  csv,  has  785  columns,  where  the  first  column,  label, 
codes  the  actual  the  digit  drawn  by  the  user.  The  remaining  784  columns  contain  the 
28  x  28  =  784  pixel-intensities  of  the  associated  2D  image.  Columns  in  the  training 
set  have  pixelK  names,  where  0  <  K  <  783.  To  reconstruct  a  2D  image  out  of  each 
row  in  the  training  data  we  use  this  relation  between  pixel-index  (K )  and  X,  Y  image 
coordinates: 


K  =  Y  x  28  +  X, 

where  0  <  X,  Y  <  27.  Thus,  pixel K  is  located  on  row  Y  and  column  X  of  the 
corresponding  2D  Image  of  size  28  x  28.  For  instance, 
pixel60  =  (2  x  28  +  4)  (X  =  4,  Y  =  2)  represents  the  pixel  on  the  third  row  and 
fifth  column  in  the  image.  Diagrammatically,  omitting  the  “pixel”  prefix,  the  pixels 
may  be  ordered  to  reconstruct  the  2D  image  as  follows  (Table  23.3). 

Note  that  the  point- to-pixellD  transformation  (K  =  Y  x  28  +  X)  may  easily  be 
inverted  as  a  pixellD-to-point  mapping:  X  =  K  mod  28  (remainder  of  the  integer 
division  (K/ 28)  and  Y  =  K  (integer  part  of  the  division  KJ 28)).  For  example: 
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Table  23.3  Schematic  for  reconstructing  a  28  x  28  square  image  using  a  list  of  784  intensities 
corresponding  to  colors  in  the  image  reflecting  the  manual  handwritten  digits 


Row 

ColO 

COll 

Col2 

Col3 

Col5 

Col26 

Co27 

RowO 

0 

1 

2 

3 

4 

26 

27 

Rowl 

28 

29 

30 

31 

32 

54 

55 

Row2 

56 

57 

58 

59 

60 

82 

83 

RowK 

•  •  • 

.  •  • 

•  •  • 

•  •  • 

•  •  • 

.  •  • 

.  •  • 

Row26 

728 

729 

730 

731 

732 

754 

755 

Row27 

756 

757 

758 

759 

760 

782 

783 

K  <-  60 

X  <-  K  %%  28  #  X=  K  mod  28,  remainder  of  integer  division  60/28 
Y  <-  K%/%28  #  integer  part  of  the  division 

#  This  validates  that  the  application  of  both,  the  back  and  forth 
transformations,  leads  to  an  identity 

K;  X;  Y;  Y  *  28  +  X 

##  [1]  60 
##  [1]  4 
##  [1]  2 
##  [1]  60 

The  test  data  (test.csv)  has  the  same  organization  as  the  training  data,  except 
that  it  does  not  contain  the  first  label  column.  It  includes  28,000  images  and  we  can 
predict  image  labels  that  can  be  stored  as  Imageld ,  Label  pairs,  which  can  be  visually 
compared  to  the  2D  images  for  validation/inspection. 

require (mxnet) 

#  train. csv 

pothToZip  <-  tempfiLe() 

down  Load .  file  ("  http :  //www  .  socr.  umich.edu/peopLe/dinov/2017/Spring/DSPA_HS650 
/data/DigitRecognizer_TrainingData . zip" ,  pathToZip) 
train  <-  read. csv (unzip(pathToZip)) 
dim( train) 

##  [1]  42000  785 

uniink(pathToZip) 

#  test.csv 

pathToZip  <-  tempfiLe() 

download,  file ("http: //www. socr. umich.edu/peopLe/dinov/2017/Spring/DSPA_HS650 
/data/DigitRecognizer_TestingData . zip" ,  pathToZip) 
test  <-  read. csv (unzip(pathToZip) ) 
dim( test) 

##  [1]  28000  784 

unLink(pathToZip) 

train  <-  data. matrix( train) 
test  <-  data.matrix(test) 
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train. x  <-  train[j -1] 
train. y  <-  train[jl] 

#  Scaling  will  be  discussed  below 
train.x  <-  t (train .x/255) 

test  <-  t(test/255) 

Let’s  look  at  some  of  these  example  images  (Figs.  23.24,  23.25,  23.26  and  23.27): 
Library ( "imager") 

#  first  convert  the  CSV  data  (one  row  per  image,  28,000  rows) 
array _3D  <-  array (test ,  c(28}  28 ,  28000)) 

mat_2D  <-  matrix(array_3D[ , , 1] ,  nrow  =  28,  ncoL  =28) 
pLot(as.cimg(mat_2D) ) 


Fig.  23.24  Image  rendering 
of  the  first  handwritten  digit, 
stored  as  a  28  x  28  array  of 
intensities 
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Fig.  23.25  Rendering  of 
the  fifth  handwritten  digit  in 
the  list  of  28,000 
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Fig.  23.26  Another 
strategy  for  indexing  and 
plotting  handwritten  digits 
as  2D  images 
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Fig.  23.27  Sequential 
image  plot  of  the  first  four 
handwritten  digits 


#  extract  all  N=28^000  images 
N  <-  28000 

img_3D  <-  as,cimg(array_3D[JJ]J  28 }  28 ,  N) 

#  plot  the  k-th  image  (l<=k<=N) 
k  <-  5 

pLot(img_3Dj  k) 


image_2D  <-  f unction (imgj index) { 
img[j j index j ,  drop=FALSE] 

} 


pLot(image_2D(img_3Dj  1)) 


fli  o 
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#  Plot  a  collage  of  the  first  4  images 

imappend(List(image_2D(img_3Dj  l)j  image_2D(img_3Dj  2) }  image_2D(img_3Dj  3) , 
image_2D(img_3Dj  4)),"y")  %>%  plot 


#  img  <-  image_2D(img_3DJ  1) 

#  for  (i  in  10:20)  {  imappend(list (img,  image_2D(img_3DJ  i)),"x")  } 

In  these  CSV  data  files,  each  28  x  28  image  is  represented  as  a  single  row.  The 
intensities  of  these  greyscale  images  are  stored  as  1  byte  integers,  in  the  range  [0,255], 
which  we  linearly  transformed  into  [0,1].  Note  that  we  only  scale  the  X  input,  not  the 
output  (labels).  Also,  we  don’t  have  manual  gold-standard  validation  labels  for  the 
testing  data,  i.e.,  test .  y  is  not  available  for  the  handwritten  digits  data. 


#  We  already  scaled  earlier 

#  train. x  <-  t(train.x/255) 

#  test  <-  t (test/255) 

Next,  we  can  transpose  the  input  matrix  to  n  (pixels )  x  m  (examples),  as  column 
major  format  required  by  mxnet.  The  image  labels  are  evenly  distributed: 


table ( train.y);  prop .table (table ( train. y) ) 

##  train.y 

##  0123456789 

##  4132  4684  4177  4351  4072  3795  4137  4401  4063  4188 

##  train.y 

##  0  1  2  3  4  5 

##  0.09838095  0.11152381  0.09945238  0.10359524  0.09695238  0.09035714 

##  6  7  8  9 

##  0.09850000  0.10478571  0.09673810  0.09971429 

The  majority  class  (1)  in  the  training  set  includes  11.2%  of  the  observations. 


Configuring  the  Neural  Network 


data  <-  mx. symbol .Variable ( "data" ) 

fcl  <-  mx. symbol. FullyConnected (data }  name="fcl"j  num_hidden=128) 
actl  <-  mx. symbol. Activation (fcl j  name="relul" }  act_type="relu" ) 
fc2  <-  mx. symbol. FullyConnected (actl j  name="fc2"j  num_hidden=64) 
act2  <-  mx. symbol. Activation (fc2j  name="relu2" }  act_type="relu" ) 
fc3  <-  mx. symbol. FullyConnected (act 2 j  name="fc3"j  num_hidden=10) 
softmax  <-  mx. symbol. So ftmaxOut put (fc3j  name="sm") 

data  <-  mx.symbol.Variable(MdataM)  represents  the  input  layer.  The  first  hidden 
layer,  set  by  fcl  <-  mx.symbol.FullyConnected(data,  name=”fcl”, 
num_hidden=128),  takes  the  data  as  an  input,  its  name,  and  the  number  of  hidden 
neurons  to  generate  an  output  layer. 

actl  <-  mx.symbol.Activation(fcl,  name=MrelulM,  act_type="relu")  sets  the 
activation  function,  which  takes  the  output  from  the  first  hidden  layer  "fcl"  and 
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generates  an  output  that  is  fed  into  the  second  hidden  layer  "fc2",  which  uses  fewer 
hidden  neurons  (64). 

The  process  repeats  with  the  second  activation  "act2",  resembling  "actl"  but 
using  different  input  source  and  name.  As  there  are  only  ten  digits  (0,  1, . . .,  9),  in  the 
last  layer  "fc3",  we  set  the  number  of  neurons  to  10.  At  the  end,  we  set  the  activation 
to  soft  max  to  obtain  a  probabilistic  prediction. 


Training 

We  are  almost  ready  for  the  training  process.  Before  we  start  the  computation,  let’s 
decide  what  device  we  should  use. 

devices  <-  mx.cpu() 

Here  we  assign  CPU  to  mxnet.  After  all  these  preparation,  you  can  run  the 
following  command  to  train  the  neural  network!  Note  that  in  mxnet,  the  correct 
function  to  control  the  random  process  is  mx. set. seed. 

mx. set. seed ( 1234 ) 

model  <-  mx. model. FeedForward.create(softmaXj  X=train .Xj  y=train .y, 
ctx=deviceSj  num. round=10j  array . batch . size=100} 

Learning. rate=0. 07 j  momentum=0. 9}  eval . metric=mx .metric . accuracy , 

initiaiizer=mx. init.uniform( 0.07) j 

epoch. end. callback=mx. callback. log .train .metric (100) 

) 

##  Start  training  with  1  devices 
##  [1]  Train-accuracy=0. 863031026252982 
##  [2]  Train-accuracy=0. 958285714285716 
##  [3]  Train-accuracy=0. 970785714285717 
##  [4]  Train-accuracy=0. 977857142857146 
##  [5]  Train-accuracy=0. 983238095238099 
##  [6]  Train-accuracy=0. 98521428571429 
##  [7]  Train-accuracy=0. 987095238095242 
##  [8]  Train-accuracy=0. 989309523809528 
##  [9]  Train-accuracy=0. 99214285714286 
##  [10]  Train-accuracy=0. 991452380952384 

For  10  rounds,  the  training  accuracy  exceeds  99%.  It  may  not  be  worthwhile  trying 
100  rounds,  as  this  would  increase  substantially  the  computational  complexity. 


Forecasting 

Now,  we  will  demonstrate  how  to  generate  a  forecasting  model  based  on  testing 
data,  and  how  to  evaluate  its  prediction  performance.  The  preds  matrix  has  28,000 
rows  and  10  columns,  containing  the  desired  classification  probabilities  from  the 
output  layer  of  the  neural  net.  To  extract  the  maximum  label  for  each  row,  we 
can  use  the  max  .col: 
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#  evaludate:  "pneds"  is  the  matrix  of  the  possibility  of  each  of  the  10 
numbers 

precis  <-  predict  (model,  test) 

pred. label  <-  max. col(t (preds))  -  1 
toble(pred. label ) 

##  pred.  Label 

##  0123456789 

##  2774  3228  2862  2728  2781  2401  2777  2868  2826  2755 

#  predsl  <-  ifelse(preds[2, ]  <=  0.5,  0,  1)  #  dichotomize  to  labels 

#  pred. label  =  t(predsl) 

#  table(pred. label,  test.y) 

#  calculate  sensitivity  &  specificity 

#  sensitivity(factor(predsl) ,  factor(as . numeric (test .y) ), positive  =  1) 

#  specificity(factor(predsl) ,  factor(as.numeric(test.y)),negative  =  0) 

#  preds  <-  predict (model,  test.x) 

#  dim(preds) 

#  predsl  <-  ifelse(preds[2, ]  <=  0.5,  0,  1)  #  dichotomize  to  labels 

#  pred. label  =  t(predsl) 

#  table(pred . label,  test.y) 

For  binary  classification,  mxnet  outputs  two  prediction  classes,  whereas  for 
multi-class  predictions,  it  outputs  a  matrix  of  size  n  ( classes )  x  m  (examples), 
where  the  rows  correspond  to  the  probability  of  the  class  in  the  specific  column ,  so 
all  column  sums  add  up  to  1.0. 

The  predictions  are  stored  in  a  28,000(rowy)  x  10 (colums)  matrix,  including  the 
desired  classification  probabilities  from  the  output  layer.  The  R  max  .col  function 
extracts  the  maximum  label  for  each  row. 

pred. Label  <-  max. col(t (preds))  -  1 
table(pred.  Label ) 

##  pred.  Label 

##  0123456789 

##  2774  3228  2862  2728  2781  2401  2777  2868  2826  2755 

We  can  save  the  predicted  labels  of  the  testing  handwritten  digits  to  CSV: 


predicted_lables  <-  data.frame(ImageId=l:ncoL(test),  Label=pred . Label) 
write. csv(predicted_lableSj  file=  ' predicted_lables . csv row. names=FALSEj 
quote=FALSE) 

We  can  open  the  predicted_lables  .csv  file  and  inspect  the  ML-labels 
(saved  in  the  2-column  Image  ID  and  Label  format  CSV)  assigned  to  the  28,000 
manually  drawn  digits.  As  the  testing  handwritten  digits  data  do  not  have  human- 
provided  labels,  we  can’t  quantitatively  assess  the  validity  of  the  algorithm  on  the 
testing  data  (Fig.  23.28).  However,  we  can  visually  inspect  random  handwritten  digit 
instances  (7  in  the  example  below,  image  indices  4:10)  against  their  predictions  and 
gain  intuition  of  the  accuracy  rate  of  the  ML  classifier  (Table  23.4,  Fig.  23.29). 


802 


23  Deep  Learning,  Neural  Networks 


Fig.  23.28  Plot  of  the  agreement  between  relative  frequencies  in  the  number  of  train.y  labels 
(in  range  0-9)  against  the  testing  data  predicted  labels.  These  quantities  are  not  directly  related 
(frequencies  of  digits  in  training.y  and  predicted.testing.data);  we  can’t  exlicitely  validate  the 
testing-data  predicitons,  as  we  don’t  have  gold-standard  test.y  labels!  However,  numbers  closer 
to  the  diagonal  of  the  plot  would  indicate  expected  good  classifications,  whereas,  off  diagonal 
points  may  suggest  less  effective  labeling 


Table  23.4  Predicted  labels 
for  the  set  of  the  first 
7  handwritten  digits 


Imageld 

Label 
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Fig.  23.29  Visual 
validation  of  the 
handwritten  digits  (left)  and 
their  neural  network 
prediction  (right)  for  the  set 
of  seven  images.  The 
number  and  indices  of  these 
testing  data  images  can  be 
manually  specified 
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tabLe( train. y) 

##  train.y 

##  0123456789 
##  4132  4684  4177  4351  4072  3795  4137  4401  4063  4188 

table (predicted_LabLes[j 2] ) 

## 

##  0123456789 
##  2774  3228  2862  2728  2781  2401  2777  2868  2826  2755 

#  Plot  the  relative  frequencies  between  the  number  of  train.y  labels 
(in  range  0-9)  against  the  testing  data  predicted  labels. 

#  These  are  not  directly  related  (training. y  vs.  predicted. testing. data! 

#  Remember  -  we  don't  have  gold-standard  test.y  labels!  Generally  speaking, 
numbers  closer  to  the  diagonal  suggest  expected  good  classifications. 
Whereas,  off  diagonal  points  may  suggest  less  effective  labeling. 

Label. names  <-  c("0" ,  "1",  "2",  "3",  "4",  "5",  "6",  "7",  " 8 ",  "9") 
plot ( f table ( train.y) [c(l :10)],  ftable(predicted_Lables[J 2] ) [c(l : 10) ] ) 
text(ftable( train .y) [c(l : 10) j+20,  f table (predicted_lables [ , 2] ) [c(l : 10) ] , 
labels=label . names ,  cex=  1.2) 

#  For  example,  the  ML-classif ication  labels  assigned  to  the  first  7  images  ( 
from  the  28,000  testing  data  collection)  are: 
head(predicted_lableSj  n  =  7L) 

##  Image Id  Label 
##  1  1  2 

##  2  2  0 

##  3  3  9 

##  4  4  9 

##  5  5  3 

##  6  6  7 

##  7  7  0 

Library (knitr) 

kable(head(predicted_LableSj  n  =  7L ),  format  =  "markdown" ) 

#initialize  a  list  of  m=7  images  from  the  N=28,000  available  images 
m_start  <-  4 
m_end  <-  10 
if  (m_end  <=  m_start) 

{  m_end  =  m_start+l  }  #  check  that  m_end  >  m_start 

Label_Ypositons  <-  vector ()  #  initialize  the  array  of  label  positions 

on  the  plot 

for  (i  in  m_start :m_end)  { 
if  (i==m_start)  { 

imgl  <-  image_2D(img_3Dj  m_start) 

} 

else  imgl  <-  imappend( List (imgl ,  image_2D(img_3Dj  i))}"y") 

Label . names[i+l-m_start]  <-  predicted_LabLes[ij  2] 
LabeL_Ypositons[i+l-m_start]  <-  15  +  28*(i-m_start) 

} 

plot(imglj  axes=FALSE) 

text(40j  LabeL_YpositonSj  LabeLs=LabeL.names[l: (m_end-m_start) ] ,  cex=  1.2, 
coL="bLue" ) 

mtext(paste( (m_end+l-m_start) ,  "  Random  Images  \n  Indices  (m_start=" , 
m_startj  "  :  m_end="j  m_endj  ")"),  side=2 ,  Line=-6j  coL="bLack") 
mtext("ML  Classification  Labels ",  side=4 ,  line=-5}  col="blue” ) 


804 


23  Deep  Learning,  Neural  Networks 


table ( f table ( train. y)[c(l:10)],  ftable(predicted_lables[J 2] ) [c(l : 10) ] ) 


## 

## 

2401 

2728 

2755 

2774 

2777 

2781 

2826 

2862 

2868 

3228 

## 

3795 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

## 

4063 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

## 

4072 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

## 

4132 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

## 

4137 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

## 

4177 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

## 

4188 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

## 

4351 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

## 

4401 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

## 

4684 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

Examining  the  Network  Structure  Using  LeNet 

We  can  use  the  mxnet  package  LeNet  convolutional  neural  network  (CNN) 
protocol  for  learning  the  network. 

Let’s  first  construct  the  network. 

#  input 

data  <-  mx. symbol. Variable (' data' ) 

#  first  conv 

convl  <-  mx. symbol. Convolution (data=data,  kernel=c(5,5) ,  num_filter=20) 
tanhl  <-  mx. symbol. Activation (data=convlj  act_type="tanh" ) 
poo  11  <-  mx. symbol . Pooling (data=tanhl,  pool_type="max", 

kernel=c(2,2) ,  stride=c(2, 2) ) 

#  second  conv 

conv2  <-  mx. symbol. Convolution (data=poo 11,  kernel=c(5,5) ,  num_filter=50) 
tanh2  <-  mx. symbol. Activation (data=conv2j  act_type="tanh" ) 
poo  12  <-  mx. symbol. Pooling (data=tanh2j  pool_type="max" , 

kernel=c(2j2) j  stride=c(2}2)) 

#  first  fullc 

flatten  <-  mx. symbol. Flatten (data=pool2) 

fcl  <-  mx. symbol. FullyConnected(data=flattenj  num_hidden=500) 
tanh3  <-  mx. symbol. Activation (data=f cl j  act_type="tanh" ) 

#  second  fullc 

fc2  <-  mx. symbol. FullyConnected(data=tanh3j  num_hidden=10) 

#  loss 

Lenet  <-  mx. symbol. SoftmaxOutput(data=fc2) 

Next,  we  will  reshape  the  matrices  into  arrays. 

train. array  <-  train. x 

dim(train. array)  <-  c(28}  28,  1,  ncol(train.x)) 
test . array  <-  test 

dim(test. array)  <-  c(28 ,  28,  1,  ncol(test) ) 

Compare  the  training  speed  on  different  devices  -  CPU  vs.  GPU.  Start  by 
defining  the  devices. 
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n.gpu  <-  1 

device. cpu  <-  mx.cpu() 

device. gpu  <-  LappLy(0: (n.gpu-l),  function(i)  { 
mx.gpu(i) 

}) 


Passing  a  list  of  devices  is  useful  for  high-end  computational  platforms  (e.g., 
multi-GPU  systems);  mxnet  can  train  on  multiple  GPUs  or  CPUs. 

To  train  using  the  CPU,  try  fewer  iterations  as  protocol  is  computationally  very 
intense. 

mx. set. seed ( 1234 ) 
tic  <-  proc.time() 

model  <-  mx. model. FeedForward. create (lenetj  X=train . array j  y=train.y } 

ctx=device . cpUj  num. round=lj  array . batch . size=100} 

Learning . rate=0. 05 j  momentum=0 .9 }  wd=0. 00001 } 
eva L.metric=mx. metric . accuracy j 

epoch. end. caLLback=mx. callback. log . train . metric ( 100 ) ) 

##  Start  training  with  1  devices 
##  [1]  Train-accuracy=0. 522267303102625 

print (proc. time ()  -  tic) 

##  user  system  elapsed 
##  313.22  66.45  50.94 

The  corresponding  training  on  GPU  is  similar,  but  it  requires  a  separate 
GPU-compilation  of  mxnet  (/mxnet/src/storage/storage.cc:78)  with  USE_CUDA=1 
to  enable  GPU  usage. 

mx. set. seed ( 1234 ) 
tic  <-  proc. time () 

model  <-  mx. model. FeedForward. create (lenetj  X=train . array j  y=train.y } 

ctx=device.gpUj  num. round=5j  array . batch . size=100} 

Learning . rate=0. 05 j  momentum=0.9j  wd=0. 00001 } 
eva L.metric=mx. metric . accuracy j 

epoch. end. callback=mx. callback. log . train . metric ( 100 ) ) 
print (proc. time ()  -  tic) 

GPU  training  is  faster  than  CPU.  Everyone  can  submit  a  new  classification  result 
to  Kaggle  and  see  a  ranking  result  for  their  classifier.  Make  sure  you  follow  the 
specific  result-file  submission  format. 

preds  <-  predict(modelj  test .array) 
pred. Label  <-  max. col(t (preds))  -  1 

submission  <-  data.frame(ImageId=l:ncol(test)j  Label=pred.  Label) 

write. csv(submissionj  file= ' submission . csv ' j  row. names=FALSEj  quote=FALSE) 
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23.6  Classifying  Real-World  Images 

A  real-world  example  of  deep  learning  is  classification  of  2D  images  (pictures)  or 
3D  volumes  (e.g.,  neuroimages). 

The  image  classification  examples  below  shows  the  use  a  pre-trained  Inception- 
BatchNorm  Network  to  predict  a  class  of  real  world  image.  The  network  architec¬ 
ture  is  described  the  2015  Ioffe  and  Szegedy  paper.  The  pre-trained  Inception- 
BatchNorm  network  is  available  online.  This  advanced  model  gives  a  state-of-the- 
art  prediction  accuracy  on  imaging  data.  We  also  need  the  R  imager  package  to 
load  and  preprocess  the  2D  images. 

#  install. packages("imager") 
require (mxnet) 
require ( imager) 


23.6.1  Load  the  Pre-trained  Model 

Download  and  unzip  the  pre-trained  model  to  a  working  folder,  and  load  the  model 
and  the  mean  image  (used  for  preprocessing)  using  mx.nd.load  into  R.  This 
download  can  either  be  done  manually,  or  automated,  as  shown  below. 

pathToZip  <-  tempfiie() 

download. file ("http: //www. socr. umich.edu/people/dinoy/2017/Spring/DSPA_HS650 
/data/inception . zip" j  pathToZip) 
model_file  <-  unzip (pathToZip) 

#  setwd(paste(getwd( ) /'results",  sep='/')) 

model  =  mx. model. Load(paste(getwd() j  "Inception_BN"j  sep='/')j  iteration=39) 

mean.img  =  as.array(  mx.nd. Load( 

paste (getwd( ) ,  "mean_224. nd" ,  sep='/') 

) 

[ [ "mean_img "]] 

) 

dim (mean .img) 

##  [1]  224  224  3 

#  plot (mean . img) 


23.6.2  Load,  Preprocess  and  Classify  New 
Images  -  US  Weather  Pattern 

To  classify  a  new  image,  select  the  image  and  load  it  in.  Below,  we  show  the 
classification  of  several  alternative  images  (Fig.  23.30). 
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Fig.  23.30 
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A  U.S.  weather  pattern  map  as  an  example  image  for  neural  network  image  recognition 


Library ("imager") 

#  One  should  be  able  to  load  the  image  directly  from  the  web  (but  sometimes 
there  may  be  problems,  in  which  case,  we  need  to  first  download  the  image 
and  then  load  it  in  R: 

#  im  <- 

imager : : load . image ( "http : //wiki . socr . umich . edu/images/6/69/DataManagement 
Figl.png") 

#  download  file  to  local  working  directory,  use  "wb"  mode  to  avoid  problems 
down L oad .file ("http : //wi hi. socr . umich . edu/images/6/69/DataManagementFigl . png 
",  paste(getwd()j  "resuLts/image.png"j  sep="/")j  mode  =  'wb') 

#  report  download  image  path 

paste (getwd( ) ,  " resuLts/image . png" ,  sep="/") 

img  <-  Load. image ( paste (getwd ( ) ,  "resuLts/image .png" ,  sep="/")) 
dim( img) 

##  [1]  1875  1084  1  4 

pLot( img) 

Before  feeding  the  image  to  the  deep  learning  network  for  classification,  we  need 
to  do  some  preprocessing  to  make  it  fit  the  deepnet  input  requirements.  This  image 
preprocessing  (cropping  and  subtraction  of  the  mean)  can  be  done  directly  in  R. 

preproc .image  <-function(imj  mean .image)  { 

#  crop  the  image 
shape  <-  dim(im) 
short .edge  <-  min(shape[l : 2] ) 
xx  <-  fLoor( ( shape [1]  -  short .edge)  /  2) 
yy  <  -  fLoor( (shape[2]  -  short . edge)  /  2) 
cropped  <-  crop. borders (im}  xx,  yy) 

#  resize  to  224  x  224,  needed  by  input  of  the  model. 
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resized  <-  resize(croppedj  224 ,  224) 
pLot(resized) 

#  convert  to  array  (x,  y,  channel) 

arr  <-  as.array(resized[JJJc(l:3)])  *  255 

pLot(as.cimg(arr) ) 

dim (arr)  <-  c(224,  224 ,  3) 

#  subtract  the  mean 
normed  <-  arr  -  mean.img 

#  Reshape  to  format  needed  by  mxnet  (width,  height,  channel,  num) 
dim(normed)  <-  c(224}  224,  3,  1) 

return(normed) 
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Fig.  23.31  Normalized  US  weather  pattern  map  image 


Call  the  preprocessing  function  with  the  normalized  image  (Fig.  23.31). 

normed  < -  preproc . image ( imgj  mean . img ) 

#  plot(normed) 

The  image  classification  uses  a  predict  function  to  get  the  probability  over  all 
(learned)  classes. 
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prob  <-  predict (model,  X=normed) 
dim(prob) 

##  [1]  1000  1 

The  prob  prediction  generates  a  1,000  x  1  array  representing  the  probability  of 
the  input  image  to  resemble  (be  classified  as)  the  top  1,000  known  image  categories. 
We  can  report  the  indices  of  the  top- 10  closest  image  classes  to  the  input  image: 

mox.idx  <-  order(prob[jl] j  decreasing  =  TRUE) [1:10] 
max. idx 

##  [1]  855  563  229  581  620  948  951  186  204  311 

Alternatively,  we  can  map  these  top- 10  indices  into  named  image -classes. 

synsets  <-  readLines(" synset.txt") 

print (paste0(" Top  Predicted  Image-Label  Classes:  Name=" }  synsets[max.idx] j  " 

;  Probability:  "j  prob[max.idx] ) ) 

##  [1]  "Top  Predicted  Image-Label  Classes:  Name=n04418357  theater  curtain j 

theatre  curtain ;  Probability :  0.0493971668183804" 

##  [2]  "Top  Predicted  Image-Label  Classes:  Name=n03388043  fountain ; 

Probabi Lity:  0. 0431815795600414 " 

##  [3]  "Top  Predicted  Image-Label  Classes:  Name=n02105505  komondor; 

Probability:  0.0371 582210063934 " 

##  [4]  "Top  Predicted  Image-Label  Classes:  Name=n03457902  greenhouse j 

nursery j  glasshouse ;  Probability :  0.0368415862321854" 

##  [5]  "Top  Predicted  Image-Label  Classes:  Name=n03637318  Lampshade j 

Lamp  shade ;  Probability:  0.0317880213260651" 

##  [6]  "Top  Predicted  Image-Label  Classes:  Name=n07734744  mushroom ; 

Probabi Lity:  0. 0292572267353535 " 

##  [7]  "Top  Predicted  Image-Label  Classes:  Name=n07747607  orange ; 

Probabi Lity:  0. 0284675862640142 " 

##  [8]  "Top  Predicted  Image-Label  Classes:  Name=n02094114  Norfolk  terrier; 

Probability:  0.026896309107542" 

##  [9]  "Top  Predicted  Image-Label  Classes:  Name=n02098286  Nest  Highland 

white  terrier;  Probability :  0.0257413759827614" 

##  [10]  "Top  Predicted  Image-Label  Classes:  Name=n02219486  ant}  emmet j 
pismire;  Probability:  0.0205500852316618" 

Clearly,  this  U.S.  weather  pattern  image  is  not  well  classified.  The  optimal 
prediction  suggests  this  may  be  a  theater  curtain ;  however,  the  confidence  is  very 
low,  Prob  ~  0.049.  None  of  the  other  top- 10  classes  capture  the  type  of  the  actual 
image  either. 

The  machine  learning  image  classifications  results  won’t  always  be  this  poor. 
Let’s  try  classifying  several  alternative  images. 
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23.6.3  Lake  Mapourika,  New  Zealand 

Let’s  try  the  automated  image  classification  of  this  lakeside  panorama  (Figs.  23.32 
and  23.33). 

down L oad .fiie( "https : //up Load. wikimedia, org/wi biped i a/commons/2/23/L ake_mapo 
uriba_NZ.jpeg"j  paste(getwd( ) ,  "resuits/image .png" ,  sep="/")j  mode  =  'wb') 
im  <-  Load. image (paste (getwd( ) j  " resuits/image. png" j  sep="/")) 

piot( im) 


Fig.  23.32  A  lakeside 
panorama  image  for  neural 
network  image  recognition 
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Fig.  23.33  Normalized 
lakeside  panorama  image 
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normed  < -  preproc . image ( im,  mean . img ) 
prob  <-  predict (model,  X=normed) 

max.idx  <-  order(prob[ , 1] ,  decreasing  =  TRUE) [1:10] 

print(paste0("Top  Predicted  Image-Label  Classes:  Name="j  synsets [max.idx ],  " 

;  Probability :  ",  prob [max.idx])) 

##  [1]  "Top  Predicted  Image-Label  Classes:  Name=n02894605  breakwater, 

groin,  groyne,  mole ,  bulwark,  seawall,  jetty ;  Probability : 0.648901104927063" 
##  [2]  "Top  Predicted  Image-Label  Classes:  Name=n03216828  dock,  dockage, 

docking  facility;  Probability :  0.183006703853607" 

##  [3]  "Top  Predicted  Image-Label  Classes:  Name=n09332890  lakeside, 

Lakeshore ;  Probability:  0.127718329429626" 

##  [4]  "Top  Predicted  Image-Label  Classes:  Name=n03160309  dam,  dike,  dyke; 

Probability:  0.0115784741938114" 

##  [5]  "Top  Predicted  Image-Label  Classes:  Name=n03095699  container  ship, 

containership,  container  vessel;  Probability :  0.00913785584270954" 

##  [6]  "Top  Predicted  Image-Label  Classes:  Name=n09428293  seashore,  coast, 

seacoast,  sea-coast;  Probability :  0.0043862983584404" 

##  [7]  "Top  Predicted  Image-Label  Classes:  Name=n03933933  pier; 

Probability:  0. 00410780590027571 " 

##  [8]  "Top  Predicted  Image-Label  Classes:  Name=n02859443  boathouse; 

Probability:  0. 00246214028447866" 

##  [9]  "Top  Predicted  Image-Label  Classes:  Name=n09399592  promontory, 

headland,  head,  foreland;  Probability:  0.00168424111325294" 

##  [10]  "Top  Predicted  Image-Label  Classes:  Name=n09421951  sandbar, 
sand  bar;  Probability :  0 .00106814480386674" 

This  photo  does  represent  a  lakeside,  which  is  reflected  by  the  top  three  class 
labels: 

•  Breakwater,  groin,  groyne,  mole,  bulwark,  seawall,  jetty. 

•  Dock,  dockage,  docking  facility. 

•  Lakeside,  lakeshore. 


23.6.4  Beach  Image 

Another  costal  boundary  between  water  and  land  is  represented  in  this  beach  image 
(Fig.  23.34). 

down L oad  .file ("https : //up Load.wik imedi a . org/wi kipedi a/ commons /9/90/Ho 1 1 oways 
_beach_1920xl080.jpg",  paste (getwd() ,  "results/image . png",  sep="/"),  mode  =  ' 
wb ' ) 

im  <-  load. image (paste (getwd( ),  " results/image. png" ,  sep="/")) 


plot( im) 
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Fig.  23.34  A  beach  image 
for  neural  network  image 
recognition 


•  normed  <-  preproc .  image  (im.,  mean.img) 
prob  <-  predict (mode L,  X=normed) 

max.idx  <-  order(prob[ ,  1] ,  decreasing  =  TRUE) [1 : 10] 

print (paste0( "Top  Predicted  Image-LabeL  CLasses:  Name=",  synsets[max.idx] j 
Probability:  ",  prob[max.idx] ) ) 

##  [1]  "Top  Predicted  Image-LabeL  CLasses:  Name=n09421951  sandbar, 

sand  bar ;  Probability:  0.69039398431778" 

##  [2]  "Top  Predicted  Image-LabeL  CLasses:  Name=n09332890  Lakeside, 

Lakeshore ;  Probability:  0.20282569527626" 

##  [3]  "Top  Predicted  Image-LabeL  CLasses:  Name=n09428293  seashore,  coast, 

seacoast,  sea-coastj  Probability :  0.0899285301566124" 

##  [4]  "Top  Predicted  Image-LabeL  CLasses:  Name=n02894605  breakwater, 

groin,  groyne,  mole,  bulwark,  seawaLL,  jetty ;  Probability :  0.006692836" 

##  [5]  "Top  Predicted  Image-LabeL  CLasses:  Name=n09399592  promontory, 

headland,  head,  foreland ;  Probability :  0.00204332848079503" 

##  [6]  "Top  Predicted  Image-LabeL  CLasses:  Name=n02859443  boathouse ; 

Probability:  0. 001 061 08584441245 " 

##  [7]  "Top  Predicted  Image-LabeL  CLasses:  Name=n02951358  canoe ; 

Probability:  0. 0006648441 1 9455665 " 

##  [8]  "Top  Predicted  Image-LabeL  CLasses:  Name=n09246464  cliff,  drop, 

drop-off j  Probability:  0.000416322873206809" 

##  [9]  "Top  Predicted  Image-LabeL  CLasses:  Name=n04357314  sunscreen, 

sunblock,  sun  blocker ;  Probability :  0.000338666519382969" 

##  [10]  "Top  Predicted  Image-LabeL  CLasses:  Name=n04606251  wreck; 

Probabi Lity:  0. 000292503653327003 " 

This  photo  was  classified  appropriately  and  with  high-confidence  as: 

•  Sandbar,  sand  bar. 

•  Lakeside,  lakeshore. 

•  Seashore,  coast,  seacoast,  sea-coast. 


23.6.5  Volcano 


Here  is  another  natural  image  representing  the  Mount  St.  Helens  Vocano 
(Fig.  23.35). 
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Fig.  23.35  A  volcano 
image  for  neural  network 
image  recognition  c 
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down  Load . file (" https :/ /upload . wikimedia . org/wikipedia/commons/thumb/d/dc/MSH 
82_ st_heLens_pL ume_from_ harry s_ridge_05 - 19  -  82 . j pg/1200px - MSH82_  st_heLens_pLu 
me_from_harrys_ridge_05-19-82.jpg"j  paste (getwd( ) ,  "resuLts/image . png" ,  sep=" 
/")j  mode  =  'wb') 

im  <-  Load. image (paste (getwd() ,  "resuLts/image. png" ,  sep="/")) 
pLot( im) 


prob  <-  predict (mode Lj  X=normed) 

max.idx  <-  order(prob[ j 1] j  decreasing  =  TRUE) [1 : 10] 

print (pasted ("Top  Predicted  Image-Label  Classes:  Name=" ,  synsets[max.idx] , 
Probability:  ",  prob[max.idx] ) ) 


##  [1]  "Top 
Probability : 
##  [2]  "Top 
Probability : 
##  [3]  "Top 
Probability : 
##  [4]  "Top 
Probability : 
##  [5]  "Top 
Probability : 
##  [6]  "Top 
Probability : 
##  [7]  "Top 
Probability : 
##  [8]  "Top 
Probability : 
##  [9]  "Top 
Probability : 
##  [10]  "Top 
Probability : 


Predicted  Image-Label 
0.993182718753815" 
Predicted  Image-Label 
0. 00681292032822967" 
Predicted  Image-Label 
4 . 1 5803697251 249 e -  06 " 
Predicted  Image-Label 
1 . 48333114680099e -  07 " 
Predicted  Image-Label 
1.1753 7313215621 e-08" 
Predicted  Image-Label 
7 . 44441 1 753 7098 e -09" 
Predicted  Image-Label 
2. 90055357510255e-09" 
Predicted  Image-Label 
2.271 5003211 715 2e- 09 " 
Predicted  Image-Label 
1 . 69992575571598e-09" 
Predicted  Image-Label 
1 . 25635490899612e-09" 


Classes: 

Classes: 

Classes: 

Classes: 

Classes: 

Classes: 

Classes: 

Classes: 

Classes: 

Classes: 


Name=n09472597  volcano ; 
Name=n09288635  geyser; 
Name=n09193705  alp; 
Name=n03344393  fireboat; 
Name=n04310018  steam  Locomotive; 
Name=n03388043  fountain; 
Name=n04228054  ski; 
Name=n02950826  cannon; 
Name=n03773504  missile; 
Name=n04613696  yurt; 


The  predicted  top  class  labels  for  this  image  are  perfect: 

•  Volcano. 

•  Geyser. 

•  Alp. 


814 


23  Deep  Learning,  Neural  Networks 


Fig.  23.36  A  cortical  brain 
surface  image  for  neural 
network  image  recognition 
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23.6.6  Brain  Surface 


The  next  image  represents  a  2D  snapshot  of  3D  shape  reconstruction  of  a  brain  cortical 
surface.  This  image  is  particularly  difficult  to  automatically  classify  because  (1)  few 
people  have  ever  seen  a  real  brain,  (2)  the  mathematical  and  computational  models  used 
to  obtain  the  2D  manifold  representing  the  brain  surface  do  vary,  and  (3)  the  patterns  of 
sulcal  folds  and  gyral  crests  are  quite  inconsistent  between  people  (Fig.  23.36). 


down  Load . fiLe (" http : //wiki . socr . umich . edu/images/e/ea/BrainCortex2 . png" ,  pas 

te(getwd()j  "resuLts/image.png"j  sep="/")j  mode  =  'wb') 

im  <-  Load. image ( paste ( getwd ( ) j  "resuLts/image. png" j  sep="/")) 


pLot( im) 

#  normed  <-  preproc. image  (im,  mean.img) 
prob  <-  predict (mode Lj  X=normed) 

max.idx  <-  order(prob[ j 1] j  decreasing  =  TRUE) [1:10] 

print (pasted ("Top  Predicted  Image-LabeL  CLasses :  Name="j  synsets[max.idx] j 
ProbabiLity:  ",  prob[max.idx] ) ) 


##  [1]  "Top  Predicted  Image-LabeL  CLasses:  Name=n01917289 

ProbabiLity:  0.4974305331707" 

##  [2]  "Top  Predicted  Image-LabeL  CLasses:  Name=n07734744 

ProbabiLity:  0.229991897940636" 

##  [3]  "Top  Predicted  Image-LabeL  CLasses:  Name=nl3052670 

hen  of  the  woodSj  PoLyporus  frondosusj  GrifoLa  frondosaj 
ProbabiL ity:  0. 09251 75696611404 " 

##  [4]  "Top  Predicted  Image-LabeL  CLasses:  Name=n03598930 

Probab iLity:  0. 0433991 8121 69552 " 

##  [5]  "Top  Predicted  Image-LabeL  CLasses:  Name=n07718747 

gLobe  artichoke ;  ProbabiLity :  0.0150045640766621" 

##  [6]  "Top  Predicted  Image-LabeL  CLasses:  Name=n07860988 

ProbabiLity:  0. 01 243 79806220531 " 

##  [7]  "Top  Predicted  Image-LabeL  CLasses:  Name=n07715103 

ProbabiLity:  0.0115451 859310269 " 

##  [8]  "Top  Predicted  Image-LabeL  CLasses:  Name=nl2985857 

ProbabiLity:  0 .0109992604702711" 

##  [9]  "Top  Predicted  Image-LabeL  CLasses:  Name=n07714990 

Probabi Lity:  0. 009091 61 567687988 " 

##  [10]  "Top  Predicted  Image-LabeL  CLasses:  Name=n03637318 
Lamp  shade ;  ProbabiLity:  0.00754355266690254" 


brain  coraL ; 

mushroom ; 

hen- of-  the  -  woods , 

jigsaw  puzzLe; 
artichoke j 
dough ; 

cauLif Lower ; 
coraL  fungus ; 
broccoLi; 
Lampshade j 
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The  top  class  labels  for  the  brain  image  are: 

•  Brain  coral. 

•  Mushroom. 

•  Hen-of-the-woods,  hen  of  the  woods,  Polyporus  frondosus,  Grifola  frondosa. 

•  Jigsaw  puzzle. 

Imagine  if  we  can  train  a  brain  image  classifier  that  labels  individuals  (volunteers 
or  patients)  solely  based  on  their  brain  scans  into  different  classes  reflecting  their 
development  state,  clinical  phenotypes,  disease  traits,  or  aging  profiles.  This  will 
require  a  substantial  amount  of  expert-labeled  brain  scans,  intense  model  training 
and  extensive  validation.  However,  any  progress  in  this  direction  will  lead  to 
effective  computational  clinical  decision  support  systems  that  can  assist  physicians 
with  diagnosis,  tracking,  and  prognostication  of  brain  growth  and  aging  in  health  and 
disease. 


23.6.7  Face  Mask 

The  last  example  is  a  synthetic  computer-generated  image  representing  a  cartoon 
face  or  a  mask  (Fig.  23.37). 

down i oad .fiLe("http: //wi ki. socr .umich . edu/images/f/fb/FaceMas kl. png" , 
paste(getwd()j  "resuLts/image.png"j  sep="/")j  mode  =  'wb') 
im  <-  Load. image(paste(getwd() j  "resuLts/image. png" j  sep="/")) 

pLot( im) 


Fig.  23.37  A  facial  mask 
image  for  neural  network 
image  recognition 


-200  0  200  400  600  300 
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prob  <-  predict (modeLj  X=normed) 

mox.idx  <-  order(prob[,l],  decreasing  =  TRUE) [1:10] 

print (paste0( "Top  Predicted  Image-LabeL  CLasses:  Name=" ,  synsets[max.idx] j 
";  Probability:  ",  prob[max.idx] ) ) 

##  [1]  "Top  Predicted  Image-LabeL  CLasses:  Name=n03724870  mask; 

Probability:  0.376201003789902" 

##  [2]  "Top  Predicted  Image-LabeL  CLasses:  Name=n04229816  ski  mask; 

Probability:  0 .25 31647 98021317" 

##  [3]  "Top  Predicted  Image-LabeL  CLasses:  Name=n02708093  analog  clock; 

Probability:  0.0562068484723568" 

##  [4]  "Top  Predicted  Image-LabeL  CLasses:  Name=n02865351  boLo  tie,  boLo, 

boLa  tie,  boLa;  Probability:  0.029578423127532" 

##  [5]  "Top  Predicted  Image-LabeL  CLasses:  Name=n04192698  shield,  buckler; 

Probability:  0. 0278499200940132" 

##  [6]  "Top  Predicted  Image-LabeL  CLasses:  Name=n03590841  jack-o'-lantern; 

Probabi Lity:  0.01 75030305981 636 " 

##  [7]  "Top  Predicted  Image-LabeL  CLasses:  Name=n02974003  car  wheel; 

Probability:  0.0172393135726452" 

##  [8]  "Top  Predicted  Image-LabeL  CLasses:  Name=n07892512  red  wine; 

Probabi Lity:  0.01 6851 9839644432 " 

##  [9]  "Top  Predicted  Image-LabeL  CLasses:  Name=n03249569  drum, 

membranophone,  tympan;  Probability :  0.0141900414600968" 

##  [10]  "Top  Predicted  Image-LabeL  CLasses:  Name=n04447861  toilet  seat; 
Probability:  0.013601747341454" 

The  top  class  labels  for  the  face  mask  are: 

•  Mask. 

•  Ski  mask. 

•  Analog  clock. 

You  can  easily  test  the  same  image  classifier  on  your  own  images  and  identify 
classes  of  pictures  that  are  either  well  or  poorly  classified  by  the  deep  learning  based 
machine  learning  model. 


23.7  Assignment:  23.  Deep  Learning,  Neural  Networks 
23.7.1  Deep  Learning  Classification 

•  Download  the  Alzheimer’s  data  from  the  SOCR  Archive. 

•  Properly  preprocess  the  data  and  remove  outliers. 

•  Build  a  multi-layer  perceptron  as  a  classifier  and  select  proper  parameters. 

•  Classify  AD  and  NC  and  report  the  detailed  classification  accuracy  metrics  using 
cross  table,  accuracy,  sensitivity,  specificity,  LOR,  AUC. 

•  Generate  some  data/results  visualizations,  at  least  include  histograms  and  model 
graph  structures.  See  Chap.  23. 

•  Try  to  construct  a  deeper  and  more  elaborate  network  model  and  report  the 
prediction  results. 

•  Compare  your  results  with  alternative  data-driven  methods  (e.g.,  KNN). 
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23. 7.2  Deep  Learning  Regression 

•  Download  the  Allometric  relationship  data  from  SOCR  data. 

•  Preprocess  the  data  and  set  density  as  the  response  variable. 

•  Create  an  MXNet  feedforward  neural  net  model  and  properly  specify  the 
parameters. 

•  Train  a  model,  predict,  and  report  RMSE  on  the  test  data,  evaluate  the  result,  and 
justify  your  evaluation. 

•  Output  the  model’s  structure. 


23.7.3  Image  Classification 

Apply  the  deep  learning  neural  network  techniques  to  classify  some  images  using  the 
pre-trained  model  as  demonstrated  in  this  chapter: 

•  Google  images. 

•  SOCR  Neuroimaging  data. 

•  Your  own  images. 
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Summary 


The  amount,  complexity,  and  speed  of  aggregation  of  biomedical  and  healthcare 
data  will  rapidly  increase  over  the  next  decade.  It’s  likely  to  double  every  1-2  years. 
This  is  fueled  by  enormous  strides  in  digital  and  communication  technologies,  IoT 
devices,  and  Cloud  services,  as  well  as  rapid  algorithmic,  computational  and  hard¬ 
ware  advances.  The  proliferating  public  demand  for  (near)  real-time  detection, 
precise  interpretation,  and  reliable  prognostication  of  human  conditions  in  health 
and  disease  also  accelerates  that  trend. 

The  future  does  look  promising  despite  the  law  of  diminishing  returns,  which 
dictates  that  sustaining  the  trajectory  clinical  gains  and  the  speed  of  breakthrough 
developments  derived  from  this  increased  volume  of  information,  paired  with  our 
ability  to  interpret  it,  will  demand  increasingly  more  resources.  Even  incremental 
advances,  partial  solutions,  or  lower  rates  of  progress  will  likely  lead  to  substantive 
improvements  in  many  human  experiences  and  enhanced  medical  treatments. 
Figure  1  below  illustrates  a  common  predictive  analytics  protocol  for  interrogating 
big  and  complex  biomedical  and  health  datasets.  The  process  starts  by  identifying  a 
challenge,  followed  by  determining  the  sources  of  data  and  meta-data,  cleaning, 
harmonizing  and  wrangling  the  data  components,  preprocessing  the  aggregated 
archive,  model-based  and  model-free  scientific  inference,  and  ends  with  prediction, 
validation,  and  dissemination  of  data,  software,  protocols,  and  research  findings. 

Our  long-term  success  will  require  major  headways  on  multiple  fronts  of  data 
science  and  predictive  analytics.  There  are  urgent  demands  to  develop  new  algo¬ 
rithms  and  optimize  existing  ones,  introduce  novel  computational  infrastructure,  as 
well  as  enhance  the  abilities  of  the  workforce  by  overhauling  education  and  training 
activities.  Data  science  and  predictive  analytics  represents  a  new  and  transdisciplin- 
ary  field,  where  engagement  of  heterogeneous  experts,  multi-talented  team-work, 
and  open-science  collaborations  will  be  of  paramount  importance. 

The  DSPA  textbook  attempts  to  lay  the  foundation  for  some  of  the  techniques, 
strategies,  and  approaches  driving  contemporary  analytics  involving  Big  Data  (large 
size,  complex  formats,  incomplete  observations,  incongruent  features,  multiple 
sources,  and  multiple  scales).  It  includes  some  of  the  mathematical  formalisms, 
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Fig.  1  Major  steps  in  a  general  predictive  data  analytics  protocol 


computational  algorithms,  machine  learning  procedures,  and  demonstrations  for  Big 
Data  visualization,  simulation,  mining,  pattern  identification,  forecasting  and 
interpretation. 

This  textbook  (1)  contains  a  transdisciplinary  treatise  of  predictive  health  analyt¬ 
ics;  (2)  provides  a  complete  and  self-contained  treatment  of  the  theory,  experimental 
modeling,  system  development,  and  validation  of  predictive  health  analytics; 
(3)  includes  unique  case-studies,  advanced  scientific  concepts,  lightweight  tools, 
web  demos,  and  end-to-end  workflow  protocols  that  can  be  used  to  learn,  practice, 
and  apply  to  new  challenges;  and  (4)  includes  unique  interactive  content  supported 
by  the  active  community  of  over  100,000  7?-developers.  These  techniques  can  be 
translated  to  many  other  disciplines  (e.g.,  social  network  and  sentiment  analysis, 
environmental  applications,  operations  research,  and  manufacturing  engineering). 

The  following  two  examples  may  contextually  explain  the  need  for  inventive 
data-driven  science,  computational  abilities,  interdisciplinary  expertise,  and  modern 
technologies  necessary  to  achieve  desired  outcomes,  like  improving  human  health, 
or  optimizing  future  returns  on  investment.  These  aims  can  only  be  accomplished  by 
experienced  teams  of  researchers  who  can  develop  robust  decision  support  systems 
using  modern  techniques  and  protocols,  like  the  ones  described  in  this  textbook. 

•  A  geriatric  neurologist  is  examining  a  patient  complaining  of  gait  imbalance  and 
postural  instability.  To  determine  if  the  patient  may  have  Parkinson’s  disease,  the 
physician  acquires  clinical,  cognitive,  phenotypic,  imaging,  and  genetics  data 
(Big  Healthcare  Data).  Currently,  most  clinics  and  healthcare  centers  are  not 
equipped  with  skilled  data  analysts  that  can  wrangle,  harmonize  and  interpret 
such  complex  datasets,  nor  do  they  have  access  to  normative  population-wide 
summaries.  A  reader  that  completes  the  DSP  A  course  of  study  wdl  have  the  basic 
competency  and  abdity  to  manage  the  data,  generate  a  protocol  for  deriving 
candidate  biomarkers,  and  provide  an  actionable  decision  support  system.  This 
protocol  will  help  the  physician  understand  holistically  the  patient’s  health  and 
make  a  comprehensive  evidence-based  clinical  diagnosis  as  well  as  provide  a 
data-driven  prognosis. 

•  To  improve  the  return  on  investment  for  their  shareholders,  a  healthcare  man¬ 
ufacturer  needs  to  forecast  the  demand  for  their  new  product  based  on  observed 
environmental,  demographic,  market  conditions,  and  bio-social  sentiment  data. 
This  clearly  represents  another  example  of  Big  Biosocial  Data.  The  organization’s 
data-analytics  team  is  tasked  with  building  a  workflow  that  identifies,  aggregates, 
harmonizes,  models  and  analyzes  all  available  data  elements  to  generate  a  trend 
forecast.  This  system  needs  to  provide  an  automated,  adaptive,  scalable,  and 
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reliable  prediction  of  the  optimal  investment  and  R&D  allocation  that  maximizes 
the  company’s  bottom  line.  Readers  that  complete  the  materials  in  the  DSP  A 
textbook  will  be  able  to  ingest  the  observed  structured  and  unstructured  data, 
mathematically  represent  the  data  as  a  unified  computable  object,  apply  appro¬ 
priate  model-based  and  model-free  prediction  techniques  to  forecast  the  expected 
relation  between  the  company's  investment,  product  manufacturing  costs,  and 
the  general  healthcare  demand  for  this  product  by  patients  and  healthcare 
service  providers.  Applying  this  protocol  to  pilot  data  collected  by  the  company 
will  result  in  valuable  predictions  quantifying  the  interrelations  between  costs  and 
benefits,  supply  and  demand,  as  well  as  consumer  sentiment  and  health  outcomes. 

The  DSPA  materials  (book  chapters,  code  and  scripts,  data,  case  studies,  elec¬ 
tronic  materials,  and  web  demos)  may  be  used  as  a  reference  or  as  a  retraining  or 
refresher  guide.  These  resources  may  be  useful  for  formal  education  and  informal 
training,  as  well  as,  for  health  informatics,  biomedical  data  analytics,  biosocial 
computation  courses,  or  MOOCs.  Although  the  textbook  is  intended  to  be  utilized 
for  one,  or  two,  semester-long  graduate-level  courses,  readers,  trainees  and  instruc¬ 
tors  should  review  the  early  sections  of  the  textbook  for  utilization  strategies  and 
explore  the  suggested  completion  pathways. 

As  acknowledged  in  the  front  matter,  this  textbook  relies  on  the  enormous 
contributions  and  efforts  by  a  broad  community,  including  researchers,  developers, 
students,  clinicians,  bioinformaticians,  data  scientists,  open-science  investigators, 
and  funding  organizations.  The  author  strongly  encourages  all  DSPA  readers, 
educators,  and  practitioners  to  actively  contribute  to  data  science  and  predictive 
analytics,  share  data,  algorithms,  code,  protocols,  services,  successes,  failures, 
pipeline  workflows,  research  findings,  and  learning  modules.  Corrections,  sugges¬ 
tions  for  improvements,  enhancements,  and  expansions  of  the  DSPA  materials  are 
always  welcome  and  may  be  incorporated  in  electronic  updates,  errata,  and  revised 
editions  with  appropriate  credits. 
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Table  1  Glossary  of  terms  and  abbreviations  use  in  the  textbook 


Notation 

Description 

ADNI 

Alzheimer’s  Disease  Neuro imaging  Initiative 

AD 

Alzheimer’s  Disease  patients 

Allometric 

relationship 

Relationship  of  body  size  to  shape,  anatomy,  physiology  and  behavior 

ALS 

Amyotrophic  lateral  sclerosis 

API 

Application  program  interface 

Apriori 

Apriori  Association  Rules  Feaming  (Machine  Feaming)  Algorithm 

ARIMA 

Time-series  autoregressive  integrated  moving  average  model 

array 

Arrays  are  R  data  objects  used  to  represent  data  in  more  than  two  dimensions 

BD 

Big  Data 

cor 

correlation 

CV 

Cross  Validation  (an  internal  staistical  validation  of  a  prediction,  classifica¬ 
tion  or  forecasting  method) 

DL 

Deep  Feaming 

DSPA 

Data  Science  and  Predictive  Analytics 

Eigen 

Referring  to  the  general  Eigen- spectra,  eigen-value,  eigen-vector,  eigen¬ 
function 

FA 

Factor  analysis 

GPU  or  CPU 

Graphics  or  Central  Processing  Unit  (computer  chipset) 

GUI 

graphical  user  interface 

HHMI 

Howard  Hughes  Medical  Institute 

I/O 

Input/Output 

IDF 

inverse  document  frequency 

IoT 

Internet  of  Things 

JSON 

JavaScript  Object  Notation 

k-MC 

k-Means  Clustering 

(continued) 
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Table  1  (continued) 


Notation 

Description 

lm() 

linear  model 

lowess 

locally  weighted  scatterplot  smoothing 

LP  or  QP 

linear  or  quadratic  programming 

MCI 

mildly  cognitively  impared  patients 

MIDAS 

Michigan  Institute  for  Data  Science 

ML 

Machine-Learning 

MOOC 

massive  open  online  course 

MXNet 

Deep  Learning  technique  using  R  package  MXNet 

NAND 

Negative-AAD  logical  operator 

NC  or  HC 

Normal  (or  Healthy)  control  subjects 

NGS 

Next  Generation  Sequence  (Analysis) 

NLP 

Natural  Language  Processing 

OCR 

optical  character  recognition 

PCA 

Principal  Component  Analysis 

PD 

Parkinson’s  Disease  patients 

Parkinson’s  Progression  Markers  Initiative 

(R)AWS 

(Risk  for)  Alcohol  Withdrawal  Syndrome 

RMSE 

root-mean- square  error 

SEM 

structural  equation  modeling 

SOCR 

Statistics  Online  Computational  Resource 

SQL 

Structured  Query  Language  (for  database  queries) 

SVD 

Singular  value  decomposition 

SVM 

Support  Vector  Machines 

TM 

Text  Mining 

TS 

Time-series 

w.r.t. 

With  Respect  To,  e.g.,  “ Take  the  derivative  of  this  expression  w.r.t.  a\  and 
set  the  derivative  to  0,  which  yields  ( S  —  UN)a\  =  0.” 

XLSX 

Microsoft  Excel  Open  XML  Lormat  Spreadsheet  file 

XML 

extensible  Markup  Language 

XOR 

Exclusive  OR  logical  operator 

Index 


A 

Accuracy,  10,  211,  275,  276,  283,  301-303, 
307,  323-325,  334,  335,  337,  339,  340, 
342,  343,  377,  409,  424,  432,  463,  475, 
479-482,  484,  485,  497,  500,  502,  504, 
507,  508,  511,  561,  562,  573,  576,  583, 
599,  605,  692,  698,  704,  726,  767,  781, 
782,  784,  793,  800,  801,  806 
Activation,  383-385,  403,  767-769,  774,  775, 
781,  785,  799,  800 

Activation  functions,  384,  385,  767,  781 
add,  16,  22,  24,  33,  41,  146,  155,  158,  159, 
162,  225,  227,  230,  292,  332,  373, 

386,  391,  402,  403,  418,  424,  454, 

479,  530,  538,  595,  605,  633,  645, 

712, 801 

Alcohol  withdrawal  syndrome  (RAWS),  3,  824 
Allometric,  266,  817,  823 
Allometric  relationship,  817 
ALSFRS,  4,  559,  733,  783 
Alzheimer’s  disease  (AD),  4,  149-151, 

569,  823 

Alzheimer’s  disease  neuroimaging  initiative 
(ADNI),  4,  823 

Amyotrophic  lateral  sclerosis  (ALS),  4,  140, 
141,  559-569,  733,  783-784,  823 
Analog  clock,  816 
Appendix,  56-60,  138-139,  149, 

183-197,420 

Application  program  interface  (API),  525, 

784. 823 

Apriori,  267,  268,  423-427,  431,  441, 

472. 823 

ARIMA,  623,  626,  628,  630-638,  823 


array,  20,  25,  31-33,  145 
array  (),  18 

Assessment,  282-286,  510-511 
Assessment:  22.  deep  learning,  neural 
networks,  816-817 
assocplot,  40 

assocplot(x)  Cohen’s  Friendly  graph  shows  the 
deviations  from  independence  of  rows 
and  columns  in  a  two  dimensional 
contingency  table,  40 

attr,  27 

Attributes,  26,  27, 144,  289,  311,  313,  315,  342, 
530,  560,  561,  670 

axes,  41,  46,  47,  131,  152,  154,  159,  171,  191, 
219,  249,  258,  261,  368,  595,  648 
axes=TRUE,  41 


B 

Bar,  15,  140,  143,  147,  159,  161, 

162,  164 

barplot,  39,  161,  162,  164,  463 
barplot(x)  histogram  of  the  values  of  x.  Use 
horiz=FALSE  for  horizontal  bars,  39 
Beach,  811-812 

Big  Data,  1,  4,  8-10,  12,  642,  661,  765, 

819, 823 
Biomedical,  8-9 

Bivariate,  39,  40,  46,  77,  140,  153-156,  173, 
238,  240,  252,  738-739,  766,  770 
Black  box,  383,  766 
boxplot,  39,  70,  161 
boxplot(x)  ‘box-and- whiskers’  plot,  39 
Brain,  4,  178,  286,  511,  769,  814-815 
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C 

c(),  18-20,  552 

c  (),  seq  (),  rep  (),  and  data.frame  ().  Sometimes 
we  use  list  ()  and  array  ()  to  create  data 
too,  18 
C/C++,  13 

Cancer,  293,  294,  296,  298,  302,  303, 

424,  427, 432 

Caret,  322,  477,  486,  487,  491,  492,  497-510, 
554,  555,  564,  776 

Chapter,  13,  63,  69,  139,  143,  149,  164,  183, 
201,  222,  245,  268,  271,  274,  289,  295, 
298,  300,  301,  308,  317,  322,  329,  334, 
336,  337,  342,  345-348,  353,  358,  361, 
370,  373,  380,  383,  390,  392,  394, 

398,  401,  409,  414-416,  420,  427,  442, 
447-449,  465,  475-480,  488,  491,  492, 
494,  527,  546,  553,  554,  557,  563,  564, 
570,  573,  574,  585,  592,  599,  601,  623, 
657,  659,  672,  674,  684,  689,  695,  697, 
712,  713,  715,  717,  719,  720,  723,  727, 
733,  735,  736,  738,  749,  753,  756,  763, 
766,  795,817 
Chapter  22,  415,  817 
Chapter  23,  164 

Chronic  disease,  316,  330,  335,  383, 

416,  476,  503 

Classification,  144,  267,  268,  281,  286-287, 
289,  304-305,  307,  323,  331-332, 
396-403,  477,  478,  498,  510,  533, 
773-782,  795-805,  816 
Clinical,  258,  612,  614,  695 
Coast,  812 

Cognitive,  2,  4,  7,  149,  700,  820 
Color,  45,  46,  87,  132,  151,  154,  165,  167,  172, 
269,  444,  649,  660 

confusionMatrix,  283,  322,  477,  480,  482,  485, 
776, 787 

Constrained,  244,  587,  735,  740-747,  750 
Contingency  table,  35,  40,  78,  500 
contour,  40 

contour(x,  y,  z)  contour  plot  (data  are 

interpolated  to  draw  the  curves),  x  and  y 
must  be  vectors  and  z  must  be  a  matrix 
so  that  dim(z)=c(length(x),  length(y)) 

(x  and  y  may  be  omitted),  40 
coplot,  40 

coplot(x~y  I  z)  bivariate  plot  of  x  and  y  for  each 
value  or  interval  of  values  of  z,  40 
Coral,  815 

Cosine,  659,  685,  695 
Cosine  similarity,  695 


Cost  function,  217,  503,  573,  586,  703,  735, 
743,  747,  757,  758 

CPU,  553,  765,  775,  782,  800,  804,  805,  823 
Create,  19,  22,  76-78,  83,  132,  174,  202,  214, 
222,  224,  273,  274,  299,  315,  318,  319, 
370,  380,  383,  390,  450,  461,  489,  491, 
504,  538,  607,  630,  638,  644,  645,  647, 
661,  674,  688,  717,  775,  781 
Cross val,  776,  787 
Cross  validation,  477,  599-601, 

733-734,  823 


D 

Data  frame,  19,  21,  22,  24,  28,  29,  31,  33-36, 
39,  40,  47,  48,  66,  131,  132,  153, 

164,  172,  174,  273,  274,  299,  300, 

319,  438,  451,  490,  514,  526,  529, 

537,  540,  547-549,  555,  561,  562, 

565, 608 

data.frame,  19,  25,  83,  103,  164,  273 
Data  science,  1,  9,  11,  661,  823,  824 
Data  Science  and  Predictive  Analytics  (DSPA), 
1,  11-13,  198,  492,  623,  661, 

819-821,  823 

Decision  tree,  307,  310-316,  498,  510,  533 
Deep  learning,  765-768,  816-817,  823,  824 
classification,  816 
regression,  817 

Denoising,  735,  756,  757,  760,  763 
Density,  46,  48,  49,  72,  98,  132,  133,  140,  141, 
143-147,  173,  174,  198,  287,  289 
Device,  775,  800 
diagnosticErrors,  718,  776 
Dichotomous,  40,  271,  318,  459,  460,  478,  655, 
698,  733,  746,  747,  770 
Dimensionality  reduction,  233,  265-266 
Divide-and-conquer,  307,  311,  373 
Divide  and  conquer  classification,  307 
Divorce,  443,  448-455,  467,  470 
dotchart,  39 

dotchart(x)  if  x  is  a  data  frame,  plots  a 

Cleveland  dot  plot  (stacked  plots  line- 
by-line  and  column-by-column),  39 
Download,  15,  555,  806,  817 


E 

Earthquake,  132-135,  157,  159,  172 
Ebola,  5 
Eigen,  219,  823 
Entropy,  311-313,  342 
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Error,  28,  47,  57,  60,  162,  163,  217,  254,  258, 
270,  280,  281,  287,  302,  305,  311,  313, 
316,  321,  324-325,  328,  329,  331,  332, 
350,  361,  378,  388,  391,  393,  412, 
478-480,  487,  491,  500,  501,  504,  507, 
509,  562,  565,  573,  576,  579,  582-584, 
586,  587,  599,  618,  640,  645,  648,  697, 
701-703,  712,  714,  725,  733,  734, 

784, 824 

Evaluation,  268,  282,  322,  335,  361,  443,  451, 
475,  477,  491,  492,  501,  504,  507,  510, 
543,546,  554,  697,  703,817 
Exome,  6 

Expectations,  11-12 
Explanations,  41,  510 

F 

Face,  815-816 

Factor,  21,  24,  46,  79,  210,  219,  233,  255,  256, 
259,  265,  287,  292,  294,  299,  319,  333, 
352,  359,  412,  417,  438,  561,  570,  575, 
588,  600,  608,  630,  638-640,  644,  676, 
677,  703,  725 

Factor  analysis  (FA),  233,  242,  243,  254-256, 
262,  265,  638,  639,  644,  823 
False-negative,  700 
False-positive,  325,  573,  574,  619 
Feature  selection,  557-559,  571-572 
Feedforward  neural  net,  817 
filled.contour(x,  y,  z)  areas  between  the 

contours  are  colored,  and  a  legend  of  the 
colors  is  drawn  as  well,  40 
Flowers,  39,  63,  309,  383,  410,  411,  414,  510 
Format,  13,  17-18,  22,  36,  38,  427,  513-515, 
522,  524,  525,  529,  537,  553,  665, 799, 
801, 805 

Foundations,  13,  638-641 
fourfoldplot(x)  visualizes,  with  quarters  of 
circles,  the  association  between  two 
dichotomous  variables  for  different 
populations  (x  must  be  an  array  with 
dim=c(2,  2,  k),  or  a  matrix  with  dim=c 
(2,  2)  if  k  =  1),  40 

Frequencies,  29,  39,  46,  145,  193,  298,  429, 
430,  439,  463,  484,  485,  667,  672,  685 
Function,  2,  4,  16,  20,  22,  28,  30,  32-35,  37, 
47-50,  57-60,  66,  68-70,  76-78,  83, 
131-133,  143,  145,  148,  149,  151,  153, 
155,  157,  161,  162,  167, 172-175, 187, 
202,  207,  208,  213,  216-219,  222,  224, 
225,  234,  243,  246,  247,  251,  254,  255, 
257,  260,  267,  269,  272-274,  289,  295, 


299,  300,  308,  313,  314,  317,  319,  322, 
323,  332,  334,  337,  351,  352,  356,  358, 
361,  370,  375,  376,  378,  383-385, 
390-392,  394-397,  401-403,  411-413, 
427,  428,  432,  434,  438,  449-451,  455, 
470,  475,  479,  480,  483,  490,  494, 
499-501,  504-506,  508,  509,  514,  524, 
526,  530,  532,  542,  547-554,  560,  561, 
563,  569,  575,  579,  582,  586,  595,  600, 
602,  607,  616,  625,  631,  632,  634,  637, 
640,  644,  645,  649,  655,  660,  664-667, 
673-676,  688,  702,  709,  713,  714,  716, 
717,  735-741,  748,  749,  753,  767-770, 
772,  774-776,  781,  782,  785,  799-801, 
808,  823 

Functional  magnetic  resonance  imaging 
(fMRI),  178-181,  623,  657 

Function  optimization,  243,  735,  761-763 


G 

Gaussian  mixture  modeling,  443 
Generalized  estimating  equations  (GEE), 
653-657 

Geyser,  174,  175,  813 

ggplot2,  14,  16,  131,  132,  157,  164,  172, 

455, 648 

Gini,  311,  313,  335,  336,  342 
Glossary,  823 

Google,  383,  388-394,  396-398,  416,  491, 
492,  494,  658,  697-700,  773,  784,  817 
GPU,  513,  553,  765,  775,  782,  804,  805,  823 
Graph,  14,  40,  47,  70,  75,  77,  164,  166,  198, 
244,  287,  297,  305,  356,  376,  386,  391, 
393,  399,  430,  431,  443,  448,  489, 
528-533,  555,  562,  563,  570,  613,  626, 
628,  649,  650,  658,  676,  775,  784 
Graphical  user  interfaces  (GUIs),  15-16,  823 


H 

Handwritten  digits,  795,  799,  801 
HC,  135,  705,  824 
Heatmap,  134,  150-152 
Help,  16 

Heterogeneity,  11,  311 

Hidden,  135,  386,  391,  393,  394,  398, 

416,  660,  765-767,  772,  774,  775, 
781,785,  799 

Hierarchical  clustering,  443,  467-469,  727 
High-throughput  big  data  analytics,  10 
hist,  39,  83,  144 

hist(x)  histogram  of  the  frequencies  of  x,  39 
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Histogram,  39,  46,  51,  68,  71-74,  87,  140,  143, 
144,  146,  174,  180,  198,  222,  249,  250, 
353,  356,  634,  792 

Horizontal,  39,  45,  46,  70,  151,  152,  159,  230, 
356, 368 

Hospital,  346,  347,  513,  655,  656 
Howard  Hughes  Medical  Institute  (HHMI), 

5,  823 

I 

IBS,  789-792 

if  TRUE  superposes  the  plot  on  the  previous 
one  (if  it  exists),  41 
Image,  20,  24,  40,  83,  84,  176-178, 

403,404,  660,  781,795,796, 

799, 801, 806-816 
Image  classification,  817 
image(x,  y,  z)  plotting  actual  data  with 
colors,  40 

Independent  component  analysis  (ICA),  233, 
242, 243,  250-254,  265 
Index,  187,  313,  316,  388,  389,  392,  416,  513, 
625,  641 

Inference,  1,  13,  201,  282,  289,  513,  573,  638, 
655,659,  735,819 

Input/output  (I/O),  22-24,  64,  765,  823 
interaction. plot  (fl,  f2,  y)  if  fl  and  f2  are 
factors,  plots  the  means  of  y  (on  the 
y-axis)  with  respect  to  the  values  of  fl 
(on  the  x-axis)  and  of  f2  (different 
curves).  The  option  fun  allows  to  choose 
the  summary  statistic  of  y  (by  default 
fun=mean),  40 
Interpolate,  48 
Intersect,  30 

Inverse  document  frequency  (IDF),  659, 
676-686,  695,  823 
Iris,  63,  64,  308-310,  409-411, 

414, 727 


J 

Java,  10,  13,  20,  72,  332,  334,  349,  534 
Jitter,  143,  157 

JSON,  198,  513,  514,  522,  525-526, 
531,533,823 


K 

k-Means  Clustering  (k-MC),  443 
k-nearest  neighbor  (kNN),  268,  269,  447 
Knockoff,  574,  621 


L 

Lagrange,  401,  402,  735,  740-741,  749, 
753-756,  762 
Lake  Mapourika,  810-811 
Lattice,  46,  47 

Layer,  386,  388,  394,  765-768,  770,  771, 
773-775,  781,  782,  785,  799-801 
Lazy  learning,  267,  286-287 
Length,  5,  19,  21,  26,  28,  35,  37,  40,  46,  47,  63, 
64,  132,  174,  230,  231,  235,  270,  273, 
346,  374,  377,  409,  480 
Letters,  148,  193,  195,  215,  404,  530,  664 
Linear  algebra,  201,  229-231,  345 
Linear  mixed  models,  623 
Linear  model,  574-582,  621,  650 
Linear  programming,  735,  748 
list  (),  18-20 

lm  (),  16,  225,  358,  553,  824 
log,  30,  31,  40,  313,  517,  587,  610,  611,  615, 
616,  640,  716 
Log-linear,  40 

Long,  5,  13,  18,  36,  514,  547,  565,  676, 

784, 819 

Longitudinal  data,  40,  657-658 
Lowess,  824 


M 

Machine  learning,  2,  10,  267,  268,  289,  322, 
383,  423,  443-444,  476,  477,  481,  497, 
536,  549,  562,  659,  660,  667,  689,  765, 
809,  816,  820 

Managing  data,  63,  140-141 

Mask,  815-816 

matplot(x,  y)  bivariate  plot  of  the  first  column  of 
x  vs.  the  first  one  of  y,  the  second  one  of 
x  vs.  the  second  one  of  y,  etc,  40 

Matrices,  20,  21,  24,  31,  149,  167,  201-203, 
206-209,  213-216,  219,  220,  222,  229, 
230,  233,  258,  478-480,  490,  549,  574, 
640,  641,  645,  650,  667,  672,  698,  714, 
716,  735,  782,  804 

Matrix,  13,  21,  26,  28,  31,  32,  40,  46,  47,  81, 
132,  149-151,  153,  161-163,  166, 

167,  174,  201-209,  211,  212,  214-217, 
219-222,  224,  225,  227-231,  235,  236, 
238-240,  242,  244,  245,  247,  251, 
254-258,  260,  265,  295,  299,  300,  304, 
305,  319,  322,  324-325,  350,  351,  356, 
391,  427,  429-432,  450,  463,  478,  480, 
483,  484,  501,  506,  507,  528-530,  537, 
540,  552,  555,  574,  582,  607,  608,  620, 
639-641,  648,  650,  654,  655,  660, 
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667-668,  670-674,  676,  685,  688,  689, 
695,  702,  716,  717,  727,  735,  739,  747, 
748,  753,  766,  767,  782,  799-801 
Matrix  computing,  201,  229-231,  345 
Michigan  Institute  for  Data  Science 
(MIDAS),  824 

Mild  cognitive  impairment  (MCI),  4,  149, 

151,  824 

Misclassification,  311,  324-325,  411,  418 
mlbench,  536,  774 
mlp,  774,  775,781,782,  785 
Model,  2,  10,  13,  47-48,  81,  93,  110,  120,  166, 
201,  216,  217,  227,  230,  246,  252,  253, 
260,  262,  267,  268,  274-276,  283, 

286,  299-301,  345,  350,  356,  358-375, 
377-381,  383,  385-387,  391-394,  397, 
398,  405-409,  411-416,  418,  479, 

488,  489,  510,  511,  571,  572,  658, 

733. 734. 817 

Model  performance,  268,  274-276,  300-301, 
322-323,  333,  359-373,  377-380,  386, 
392-394,  406-409,  412-414,  433-438, 
451-454,  462-465,  475,  479,  480,  487, 
488,  491,  492,  494,  495,  497,  501-503, 
507,  564-569,  572,  605,  697,  698,  701 
Model-based,  2,  10,  345,  481,  566,  573,  660, 
710, 819,  821 

Model-free,  2,  10,  481,  660,  689,  705,  819,  821 
Modeling,  1,  4,  9,  13,  48,  83,  201,  216-217, 
233,  259,  307,  347,  349,  505,  513,  528, 
582,  638,  640,  659,  668,  701,  703,  756, 
775, 820, 824 
MOOCs,  821 
mosaicplot,  40 

mosaicplot(x)  “mosaic”  graph  of  the  residuals 
from  a  log-linear  regression  of  a 
contingency  table,  40 
Multi-scale,  623 
Multi-source,  9,  514,  559 
MXNet,  774,  775,  782,  785,  799-801,  804, 

805. 817 


N 

NA,  22,  24,  28,  30,  38,  67,  69,  155,  287,  380, 
427,  429,  538,  625 
na.omit,  28,  48 
na.omit(x),  28 

Naive  Bayes,  289,  290,  299,  302-305,  476 
Natural  language  processing  (NLP),  442, 
659-668,  689-691,  694-695,  824 
Nearest  neighbors,  267,  286-287,  719-720 
Negative  AND  (NAND),  771-772,  824 


Network,  383,  384,  386,  398,  533,  555,  730, 
731,773,  799-800,  804-806 
Neural  networks,  383-388,  498,  510,  717-718, 
765,766,  816-817 
Neurodegeneration,  4-5 
Neuroimaging,  4,  7,  588-590,  608-621,  789, 
817,  823 

New  Zealand,  810-811 
Next  Generation  Sequence  (NGS),  6-7,  824 
Next  Generation  Sequence  (NGS)  Analysis, 
6-7 

Nodes,  164,  293,  307,  311,  316,  321,  336,  374, 
376,  379,  383,  386,  391,  393,  394, 

416,  524,  528-530,  532,  765,  766, 

768,  775,  785 

Non-linear  optimization,  752-753,  762 
Normal  controls  (NC),  4,  149,  151,  152,  167, 
169-171,  824 

Numeric,  2,  19,  25,  46,  47,  66,  68,  71,  76,  77, 
145,  149,  150,  212,  259,  273,  274,  299, 
319,  370-371,  377,  396,  409,  503, 

559,  570 


O 

Objective  function,  242,  250,  251,  401,  558, 
573,  574,  579,  587,  592,  640,  641, 
735-738,  740,  741,  747-749,  753, 
754,  756-758 
Open-science,  1,  819,  821 
Optical  character  recognition  (OCR),  383, 
403-408,  795,  824 
Optimization,  13,  47,  243,  254,  401, 
402,513,  546,  573,  574,  579, 

587,  592,  641,  735-753,  755, 

756,  761,  762 

Optimize,  47,  337,  401,  739,  757, 

758,819 


P 

Package,  30,  38,  46,  63,  78,  81,  131,  132,  138, 
149,  157,  164,  167,  172,  174,  208,  247, 
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239,  311,  356,  357,  371,  424-427,  529, 
532,  770,  796 

pairs(x)  if  x  is  a  matrix  or  a  data  frame,  draws  all 
possible  bivariate  plots  between  the 
columns  of  x,  40 

Parallel  computing,  548-553,  555-556 
Parkinson’s  disease  (PD),  51,  135,  261,  262, 
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Parkinson’s  Progression  Markers  Initiative 
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642,  656,  705,711,719,  824 
Perceptron,  766,  769,  773,  775,  785 
Perl,  13 

persp(x,  y,  z)  plotting  actual  data  in  perspective 
view,  40 

Petal,  39,  64,  727 

Pie,  39,  143,  147,  149,  167,  170,  198 

pie(x)  circular  pie-chart.,  39 
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629-631,  727,  736,  739,  742,  757,  768, 
776,  778,  779,  792 

plot(x)  plot  of  the  values  of  x  (on  the  y-axis) 
ordered  on  the  x-axis,  39 
plot(x,  y)  bivariate  plot  of  x  (on  the  x-axis)  and 
y  (on  the  y-axis),  39 

plot.ts(x)  if  x  is  an  object  of  class  “ts”,  plot  of  x 
with  respect  to  time,  x  may  be 
multivariate  but  the  series  must  have  the 
same  frequency  and  dates.  Detailed 
examples  are  in  Chap.  19 
big  longitudinal  data  analysis,  40 
Predict,  3,  4,  9,  10,  48,  81,  267,  283,  300,  322, 
334,  346,  377-380,  389,  391,  392,  411, 
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582,  584-586,  600,  602,  623,  674,  679, 
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Pruning,  307,  315,  316,  328,  330 
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Python,  Java,  C/C++,  Perl,  and  many  others,  13 
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QOL,  317 
qqnorm,  40 

qqnorm(x)  quantiles  of  x  with  respect  to  the 

values  expected  under  a  normal  law,  40 
qqplot,  40 

qqplot(x,  y)  quantiles  of  y  with  respect  to  the 
quantiles  of  x,  40 
Quadratic  programming,  824 
Quality  of  life,  490,  792 
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Regularized  linear  model,  621 
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349,  350,  369,  383,  394,  448,  454,  532, 
581,584,  640,817,  823 
rep(),  18 

Require,  12,  400,  409,  432,  527,  550,  555,  563, 
772,  815,  819 
reshape2,  14,  16 

Risk  for  Alcohol  Withdrawal  Syndrome 
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Root  mean  square  error  (RMSE),  329,  477,  565, 
698,  701,  703,  784,  817,  824 
RStudio,  13,  15-16 
RStudio  GUI,  15-16 
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Scatter  plot 

scatter,  46,  153,  226,  230,  231 
Sensitivity,  305,  485,  486,  714,  734,  776 
seq  (),  18,  38,  69 
Sequencing,  6 

set. seed,  37,  49,  287,  333,  492,  499,  782,  800 
setdiff,  30 
setequal,  30 

Silhouette,  443,  446,  451,  452,  456-459,  463, 
464,  469,  477,  723,  725 
sin,  cos,  tan,  asin,  acos,  atan,  atan2,  log,  log  10, 
exp  and  “set”  functions  union(x,  y), 
intersect(x,  y),  setdiff(x,  y),  setequal 
(x,  y),  is.element(el,  set)  are  available 
in  R,  30 

Singular  value  decomposition  (SVD),  233,  241, 
242,  256-258,  265,  824 
Size,  16,  30,  46,  47,  49,  132,  135,  145,  154, 
174,  192,  209,  210,  269,  315,  316,  323, 
328,  336,  348,  390,  425,  426,  429,  450, 
451,  495,  498,  500,  503,  510,  515, 
534-536,  565,  566,  572,  592,  624, 

676,  747,  767,  773,  774,  781,  784, 

795, 819, 823 
sMRI,  4,  178 

softmax,  498,  767,  774,  781,  800 
Sonar,  774-781 
Sort,  28,  700 

Specificity,  305,  485-486,  734,  776 
Spectra,  231 

Splitting,  268,  307,  311,  315,  373,  374,  536, 
584, 686 

SQL,  138-139,  513,  515-521,  537,  553,  824 
Stacked,  39,  46,  196 

stars(x)  if  x  is  a  matrix  or  a  data  frame,  draws  a 
graph  with  segments  or  a  star  where  each 
row  of  x  is  represented  by  a  star 
and  the  columns  are  the  lengths 
of  the  segments,  40 
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(SOCR),  4,  10,  11,  50,  51,  56,  72,  79, 
130,  140,  147,  171,  173,  178,  187,  193, 
198,  230,  258,  305,  342,  349,  522,  524, 
525,  531-533,  540-543,  555,  569,  584, 
669, 817,  824 
stripplot,  39,  46 


stripplot(x)  plot  of  the  values  of  x  on  a  line 
(an  alternative  to  boxplot()  for  small 
sample  sizes),  39 

Structural  equation  modeling  (SEM),  623, 
638-648,  824 

Summary  statistic,  35,  40,  67,  76,  140,  187, 
352,  549 

sunflowerplot(x,  y)  id.  than  plot()  but  the  points 
with  similar  coordinates  are  drawn  as 
flowers  which  petal  number  represents 
the  number  of  points,  39 
Support  vector  machines  (SVM),  398-403 
Surface,  132,  141,  174-176,  814-815 
Symbol,  47,  404,  490,  781,  799 
symbols(x,  y,  ...)  draws,  at  the  coordinates 
given  by  x  and  y,  symbols  (circles, 
squares,  rectangles,  stars,  thermometers 
or  “boxplots”)  which  sizes,  colors... 
are  specified  by  supplementary 
arguments,  40 
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Table,  13,  22-24,  29,  30,  32,  35,  38,  40,  76,  78, 
79,  140,  144,  148,  166,  208,  268,  274, 
275,  282,  292,  300,  301,  311,  317,  322, 
412,  426,  450,  463,  477-480,  482,  483, 
486,  501,  504,  511,  529,  530,  548,  555, 
614,  641,686,  771 
TensorFlow,  765,  773,  784 
Term  frequency  (TF),  659,  676-686,  695 
termplot(mod.obj)  plot  of  the  (partial)  effects  of 
a  regression  model  (mod.obj),  40 
Testing,  7,  268,  274,  282,  287,  299,  303,  318, 
324,  342,  396,  414,  491,  505,  579,  581, 
584,  599,  600,  639,  648,  679,  684,  686, 
690,  691,  697,  701,  703,  704,  719,  765, 
775, 782,  784,  795,  799-801 
Text  mining  (TM),  442,  659-668,  689-691, 
694-695,  824 

The  following  parameters  are  common  to  many 
plotting  functions,  40 

Then,  try  to  perform  a  multiple  classes  (i.e  AD, 
NC  and  MCI)  classification  and  report 
the  results,  816 

Training,  8,  141,  260,  267-270,  274,  281,  287, 
289,  292,  295-297,  299-300,  303,  304, 
311,  318-321,  332-333,  337,  358-359, 
374,  375,  380,  390,  391,  395,  396,  398, 
410-412,  416,  418,  432-433,  450-451, 
461,  491,  493,  495,  501,  503-505,  507, 
553,  554,  558-564,  579,  584,  599,  600, 
679,  684,  686,  688,  697,  701-704,  715, 
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Transdisciplinary,  9,  819,  820 
Trauma,  163,  443,  459-467 
ts,  1,  31,  40,  47,  77,  80,  533,  629,  631 
ts.plot(x)  id.  but  if  x  is  multivariate  the  series 
may  have  different  dates  and  must  have 
the  same  frequency,  40 
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571,590,  592,  626,  648,726 
Volcano,  175,  812-813 

W 

which.max,  27 
Whole-genome,  6 
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630,  634,  637-639,  642,  643,  648,  653, 
655,  656,  663,  665,  668,  670,  672,  674, 
676-678,  684,  686,  688,  689,  698,  700, 
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